Run jobs on Dataproc - Week 1 Module 2

Road to Google Cloud Platform Certification

less than 1 minute read

Maël Fabien

co-founder & ceo @ biped.ai

Run jobs on Dataproc

Dataproc comes with pre-installed softwares such as :

Apache Hive : for SQL-like processing of structured data. HiveQL is an imperative language.
Pig : for cleaning data and turning semi-structured data into structured data. Pig is a declarative language, and does not decide of the resource allocation. It can fit better in a pipeline.
and Spark : for data processing and pipelines, ideal for unstructured data

To submit a job, we can establish a SSH tunnel to the cluster and run Pig/Spark.