How to install Spark?

We’ll explore 2 ways to install Spark :

  • using Jupyter Notebooks
  • using the Scala API
  • using the Python API (PySpark)

Using Jupyter Notebooks

Programming in Scala in Jupyter notebooks requires installing a package to activate Scala Kernels:

pip install spylon-kernel
python -m spylon_kernel install

Then, simply start a new notebook and select the spylon-kernel.

Using Scala

To install Scala locally, download the Java SE Development Kit “Java SE Development Kit 8u181” from Oracle’s website. Make sure to use version 8, since there are some conflicts with higher vesions.

Then, on Apache Spark website, download the latest version. When I did the first install, version 2.3.1 for Hadoop 2.7 was the last.

Download the release, and save it in your Home repository. To know where it is located, type echo $HOME in your Terminal. It usually is /Users/YourName/.

To make sure that the installation is working, in your terminal, in your Home repository, type `(replace your version) :

cd spark-2.3.1-bin-hadoop2.7/bin
./spark-shell

Your terminal should look like this :

image

A user interface, called the Spark Shell application UI, should also be accessible on localhost:4040.

image

Finally, we need to install SBT, an open-source build tool for Scala and Java projects, similar to Java’s Maven and Ant.

Its main features are:

  • Native support for compiling Scala code and integrating with many Scala test frameworks
  • The continuous compilation, testing, and deployment
  • Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run, etc.)
  • Build descriptions written in Scala using a DSL
  • Dependency management using Ivy (which supports Maven-format repositories)
  • Integration with the Scala interpreter for rapid iteration and debugging
  • Support for mixed Java/Scala projects

Installed in the terminal using :

brew install sbt

To check that the installation is fully working, run :

./spark-shell

You should see a Scala interpreter :

Welcome to
   ____              __  
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.3.1
   /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type: help for more information.

scala>

Using PySpark

For PySpark, simply run :

pip install pyspark

Then, in your terminal, launch: pyspark

Observe that you now have access to a Python interpreter instead of a Scala one.

Welcome to
   ____              __
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 2.3.1
   /_/

Using Python version 3.6.5 (default, Apr 26 2018 08:42:37)
SparkSession available as 'spark'.
>>> 

Doing this install, your are also able to use PySpark in Jupyter notebooks by running :

import pyspark

Conclusion: I hope this tutorial was helpful. I’d be happy to answer any question you might have in the comments section.