How to install Spark?
We’ll explore 2 ways to install Spark :
- using Jupyter Notebooks
- using the Scala API
- using the Python API (PySpark)
Using Jupyter Notebooks
Programming in Scala in Jupyter notebooks requires installing a package to activate Scala Kernels:
pip install spylon-kernel python -m spylon_kernel install
Then, simply start a new notebook and select the
To install Scala locally, download the Java SE Development Kit “Java SE Development Kit 8u181” from Oracle’s website. Make sure to use version 8, since there are some conflicts with higher vesions.
Then, on Apache Spark website, download the latest version. When I did the first install, version 2.3.1 for Hadoop 2.7 was the last.
Download the release, and save it in your Home repository. To know where it is located, type
echo $HOME in your Terminal. It usually is
To make sure that the installation is working, in your terminal, in your Home repository, type `(replace your version) :
cd spark-2.3.1-bin-hadoop2.7/bin ./spark-shell
Your terminal should look like this :
A user interface, called the Spark Shell application UI, should also be accessible on localhost:4040.
Finally, we need to install SBT, an open-source build tool for Scala and Java projects, similar to Java’s Maven and Ant.
Its main features are:
- Native support for compiling Scala code and integrating with many Scala test frameworks
- The continuous compilation, testing, and deployment
- Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run, etc.)
- Build descriptions written in Scala using a DSL
- Dependency management using Ivy (which supports Maven-format repositories)
- Integration with the Scala interpreter for rapid iteration and debugging
- Support for mixed Java/Scala projects
Installed in the terminal using :
brew install sbt
To check that the installation is fully working, run :
You should see a Scala interpreter :
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181) Type in expressions to have them evaluated. Type: help for more information. scala>
For PySpark, simply run :
pip install pyspark
Then, in your terminal, launch:
Observe that you now have access to a Python interpreter instead of a Scala one.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 3.6.5 (default, Apr 26 2018 08:42:37) SparkSession available as 'spark'. >>>
Doing this install, your are also able to use PySpark in Jupyter notebooks by running :
Conclusion: I hope this tutorial was helpful. I’d be happy to answer any question you might have in the comments section.
Like it? Buy me a coffee