Run jobs on Dataproc

Dataproc comes with pre-installed softwares such as :

  • Apache Hive : for SQL-like processing of structured data. HiveQL is an imperative language.
  • Pig : for cleaning data and turning semi-structured data into structured data. Pig is a declarative language, and does not decide of the resource allocation. It can fit better in a pipeline.
  • and Spark : for data processing and pipelines, ideal for unstructured data

To submit a job, we can establish a SSH tunnel to the cluster and run Pig/Spark.