Execute MapReduce Job in Python locally -

Alternatively, instead of using HortonWorks Sandbox, we can execute MapReduce jobs locally.

1) First of all, download the Hadoop compressed file from Apache’s website

2) Unzip this file, and put it at your root: Users/yourname

3) Create a folder on your working directory (for example on your desktop).

4) Create a short text file called file.txt

5) Test the Hadoop MapReduce utility by running the following command in your terminal :

mapred streaming -input Path-To-Input-File/file.txt -output Path-To-Input-File/Output -mapper /bin/cat -reducer /usr/bin/wc

If the MapReduce utility works correctly, you should have an output folder created, and 2 new files inside of it. If you have a problem, make sure you have the environment variable JAVA_HOME set (refer to Java SKD download).

6) To submit MapReduce Jobs, as previously, create 2 files: mapper.py and reducer.py.

mapper.py :

#!/usr/bin/python3
import sys
def main(argv):
    for line in sys.stdin:
        wordlist = line.strip().split()
        for word in wordlist:
            print(word+"\t"+"1")

if __name__ == "__main__":
    main(sys.argv)

reducer.py :

#!/usr/bin/python3
import sys
def main(argv):
    current_word = None
    wordcount = 0
    for line in sys.stdin:
        word, n = line.strip().split("\t",1)
        n = int(n)
        if current_word == word:
            wordcount += n
        else:
            if current_word:
                print(current_word+"\t"+str(wordcount))
                wordcount = n
            current_word = word
            print(current_word+"\t"+str(wordcount))

if __name__ == "__main__":
    main(sys.argv)

You should now be able to execute this MapReduce job with the following command :

mapred streaming -input Path-To-Input-File/file.txt -output Path-To-Input-File/Output -mapper "python Path-To-Mapper/mapper.py" -reducer "python Path-To-Reducer/reducer.py"

I’m explicitly specifying the Python instruction to execute the mapper and reducer files to avoid access denial.

You should now observe an output with 2 files: a Sucess file, and part-0000 which contains the output of the WordCount.

Conclusion: I hope this short tutorial was helpful. I’d be happy to answer any question you might have in the comments section.

Execute MapReduce Job in Python locally

Maël Fabien