Run jobs on Dataproc
Dataproc comes with pre-installed softwares such as :
- Apache Hive : for SQL-like processing of structured data. HiveQL is an imperative language.
- Pig : for cleaning data and turning semi-structured data into structured data. Pig is a declarative language, and does not decide of the resource allocation. It can fit better in a pipeline.
- and Spark : for data processing and pipelines, ideal for unstructured data
To submit a job, we can establish a SSH tunnel to the cluster and run Pig/Spark.