If you haven’t seen it, I recommend reading first my tutorial on Hadoop MapReduce. In this series of articles, we’ll talk about Spark. Spark is an Apache Software Foundation project that was developed to speed up the Hadoop Computational software process.
What is Apache Spark?
Spark is an open-source large scale data processing and data analysis tool, coded in Scala, a functional programming language, highly typed, less verbal than Java.
Spark is available through 4 different APIs :
- Scala: Most popular API, in which the language is typed and compiled. Therefore, code errors are checked before launch, which avoids errors after 10 hours of computation for example. Scala’s API is often used in production.
- Python: Second most popular API. The language is however only checked at runtime. This API is mostly used in POCs and R&D environments.
- Java: If the company’s infrastructure is built in Java, this might be the best option.
- R: This API is rarely used, but there’s a community around it.
Spark is compatible with most databases and distributed (or not) file systems :
- S3 (AWS),
- Google Storage
Spark has an active community of over 1000 contributors, producing around 100 commits/week.
The main feature of Spark is that is stores the working dataset on the cluster’s cache memory, to allow faster computing.
Spark leverages task parallelization on multiple workers, just like MapReduce. Spark works the same way :
- On a single machine for tests and small data samples, local, without any cluster manager
- On a cluster for large data volumes
Spark is not a modified version of Hadoop, but Hadoop is a way to implement Spark. Indeed, Spark can be built with Hadoop components in the following ways :
- Spark on top of HDFS
- Spark on top of HDFS and Yarn
- Spark on top of HDFS and in MapReduce
- Spark Core, the underlying general execution engine for Spark platform. Other functionalities are built upon Spark Core, which provides in-memory computing and referencing datasets in external storage systems.
- Spark SQL: SparkSQL data-frames allow the use of RDD (Resilient Distributed Datasets), providing support for the structured and semi-structured data
- Spark Streaming: Spark Streaming allows pseudo-streaming, i.e. ingesting micro-batch (e.g every 20ms) and performing RDD transformations on those micro-batches.
- MLib: SparkML is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It includes the whole ML framework: pre-processing, cross-validation, algorithms… in a distributed system.
- GraphX: A distributed graph-processing framework on top of Spark, to model user-defined graphs by using Pregel abstraction API.
- 2003-4: MapReduce is introduced by Google
- 2006: Hadoop is developed by Yahoo!
- 2009: Spark Research projects start at UC Berkley
- 2010 : Spark Research paper
- 2010: Spark is open-sourced
- 2012 : RDD Research paper
- 2013: Spark joins the Apache Software Foundation
- 2013 : Spark Streaming paper
- 2014 : GraphX paper
- 2015 : SparkSQL paper
- 2016 : MLib paper
- 2017: Deep Learning pipelines in MLib
Why use Spark?
Hadoop applications spend more than 90% of their time doing HDFS read-write operations. Using data frames in cache greatly improves the processing time of Spark. Nowadays, Spark is widely used, and the community is way more active than around Hadoop.
Directly Acyclic Graph (DAG)
Spark offers a Directed Acyclic Graph service (DAG). The DAG :
- Cuts the script in stages: groups of tasks that do not require any data transfer between the nodes of the cluster
- Identifies the stages that can be used in parallel
- Determines where a task should be executed to minimize data transfer
- Optimizes data treatment pipeline
- Optimizes intermediate lost data collection
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable, read-only, distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
There are two ways to create RDDs :
- parallelizing an existing collection in your driver program,
- or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
In Spark, there are several data formats :
- RDD: one of the version versions of Spark data structure. It takes the form of distributed tuples, e.g
[(1,2,3,4), (10,20,30,40), ...]. There is no column structure in RDDs.
- DataFrame: DataFrames are layered on top of RDDs. It looks like a Pandas DataFrame, with columns, but is indeed an RDD of rows.
- DataSet: DataSets are layers on top of DataFrames which allow specifying types for columns.
How does Spark work?
There are 2 types of Spark RDD operations :
- Actions are operations that return values (collect, count, take…).
- Transformations are operations that can be chained on your working dataset (filter, drop, map, union).
Only actions are evaluated. This is called lazy evaluation. You can define several transformations, are they’ll all be evaluated when you run the next action. Actions are really expensive, so you should limit the number of actions in your code.
The main transformations are :
There are 2 types of transformations :
- Narrow transformation: In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD.
- Wide transformation: In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.
Actions, unlike transformations, do not produce new RDDs but values. Actions might be :
- count() : the number of elements in RDD
- collect() : returns the entire RDDs content
- take(n) : returns n number of elements from RDD
There are some key operations one can do in Spark that are interesting for parallel computation.
Since we work on a multi-node system, shuffling makes an efficient redistribution of our data on the working nodes. By shuffling the data across the cluster nodes, we can execute tasks much faster.
However, this operation requires to write data on disk and is therefore really expensive in terms on time.
Persistence is the operation that stores your dataset into fast RAM.
- In some application, and especially in machine learning, we need to access the same data multiple times.
- With .persist() we force Spark to put a data collection into RAM memory.
- If you persist your client’s dataset, each node will put his part of the dataset into cache memory.
The partitioning describes how to split datasets on a cluster. The number of partitions determines the number of tasks that will be created by the scheduler.
When you’re working with a more complex dataset, you may consider using a partition key, just like SQL. This key will determine which part of the dataset goes on which node.
Broadcasting is a way to force caching a dataset, on every node of the cluster.
Joins in Spark are similar to joins in SQL (INNER, LEFT, RIGHT, FULL).
Conclusion: All those concepts still need illustration, but it’s the purpose of the next article. I hope this high-level overview was clear and helpful. I’d be happy to answer any question you might have in the comments section.
Like it? Buy me a coffee