Alternatively, instead of using HortonWorks Sandbox, we can execute MapReduce jobs locally.
1) First of all, download the Hadoop compressed file from Apache’s website
2) Unzip this file, and put it at your root: Users/yourname
3) Create a folder on your working directory (for example on your desktop).
4) Create a short text file called
5) Test the Hadoop MapReduce utility by running the following command in your terminal :
mapred streaming -input Path-To-Input-File/file.txt -output Path-To-Input-File/Output -mapper /bin/cat -reducer /usr/bin/wc
If the MapReduce utility works correctly, you should have an output folder created, and 2 new files inside of it. If you have a problem, make sure you have the environment variable JAVA_HOME set (refer to Java SKD download).
6) To submit MapReduce Jobs, as previously, create 2 files:
#!/usr/bin/python3 import sys def main(argv): for line in sys.stdin: wordlist = line.strip().split() for word in wordlist: print(word+"\t"+"1") if __name__ == "__main__": main(sys.argv)
#!/usr/bin/python3 import sys def main(argv): current_word = None wordcount = 0 for line in sys.stdin: word, n = line.strip().split("\t",1) n = int(n) if current_word == word: wordcount += n else: if current_word: print(current_word+"\t"+str(wordcount)) wordcount = n current_word = word print(current_word+"\t"+str(wordcount)) if __name__ == "__main__": main(sys.argv)
You should now be able to execute this MapReduce job with the following command :
mapred streaming -input Path-To-Input-File/file.txt -output Path-To-Input-File/Output -mapper "python Path-To-Mapper/mapper.py" -reducer "python Path-To-Reducer/reducer.py"
I’m explicitly specifying the Python instruction to execute the mapper and reducer files to avoid access denial.
You should now observe an output with 2 files: a Sucess file, and
part-0000 which contains the output of the WordCount.
Conclusion: I hope this short tutorial was helpful. I’d be happy to answer any question you might have in the comments section.
Like it? Buy me a coffee