In the next 4 articles, we’ll launch Hadoop MapReduce jobs (WordCount on a large file) using the HortonWorks Sandbox.

Getting started with the VM

Install and launch

image

The Sandbox by Hortonworks is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop, specifically the Hortonworks Data Platform (HDP). The Sandbox comes packaged in a virtual environment that can run in the cloud or on your machine. Sandbox also offers a data-in-motion framework for IoT solutions called Hortonworks Data Flow (HDF). To configure Hadoop from scratch on a Linux VM, this tutorial might be useful: https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

Hortonworks offers a way to use Hadoop Tools connecting to a Virtual Machine in SSH for command lines interfaces. Numerous web interfaces are also available.

The sandbox can be downloaded from here.

Download the HDP Sandbox. This Sandbox makes it easy to get started with Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, Druid and Data Analytics Studio (DAS). Choose the VirtualBox installation type.

image

Fill in the form and download the version 2.6.5 of the Sandbox. The sandbox requires around 15 Go of space.

Once this is done (the download might take a long time) :

  • Open VirtualBox
  • Click on “New”
  • Go in “File -> Import a Virtual Machine”.
  • Select the image of the Sandbox you just downloaded.
  • Start the VM

image

The first boot takes a while, so time for a break!

image

Access the Sandbox

The application is now ready to be used :

image

User

Once started, the Sandbox is accessible from your local computer, in SSH, on the Port 2222. We will be using the username: raj_ops and password raj_ops as it is pre-registered.

There are many pre-configured users for the HortonWorks Sandbox, including :

image

SSH

Launch your terminal and access the Sandbox using SSH :

ssh raj_ops@localhost -p 2222

If you are asked whether you’d like to permanently add 2222 to the list of known hosts, say Yes.

Splash Page

We should be able to access a graphical view of the services available on a Hadoop cluster. It’s called the VirtualBox Splash Page, and it can be accessed on :

http://localhost:8080

You will be asked to log-in :

image

Then, after loading, the page should look like this :

image

You are now connected to the Sandbox and you can access the HDFS file system. We’ll dive deeper into this in the next articles.

Conclusion: I hope this tutorial was clear and helpful. I’d be happy to answer any question you might have in the comments section.