This tutorial outlines the basic steps to get started with Hadoop on Mac OS. This is a single-node cluster with YARN as the resource manager.
1. Install Xcode
Xcode can be installed via Apple appstore. Xcode is Apple’s Integrated Development Environment (IDE). Xcode is a large suite of software development tools and libraries from Apple.
2. Install the Apple command line tools
Once Xcode is installed, install the command line tools via “Xcode menu” –> “preferences” –> “command lines tools”, and click the install button. This may take a while to install. Once installed you can verify with a Terminal window
1 2 3 | $ xcode-select -h |
The Xcode Command Line Tools are part of XCode. The Xcode Command Line Tools include a GCC compiler, and many common Unix-based tools require the GCC compiler
3. Install homebrew
Homebrew is a package manager for OS. On a Terminal window type
1 2 3 | $ ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)" |
Verify if brew is installed properly by typing the following on a Terminal window:
1 2 3 | $ brew doctor |
4. Install Java
Hadoop is an open source software built on Java.
1 2 3 | $ brew cask install java |
Also, you will need Scala for Spark if you choose to code Spark in Scala. You can also code in Java.
1 2 3 | $ brew install scala |
5. Install wget
Wget is a handy tool to download files using http, https or ftp protocols from internet.
1 2 3 | $ brew install wget |
You can check in which folder brew has installed with
1 2 3 | $ brew info wget |
6. Download Hadoop with wget
As we were using Spark 2.3.2 in the earlier tutorials, let’s download hadopp 2.7.7 from http://mirror.nohup.it/apache/hadoop/common.
1 2 3 4 5 6 | $ cd /usr/local $ sudo wget http://mirror.nohup.it/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz $ sudo tar xvzf hadoop-2.7.7.tar.gz $ sudo rm -f hadoop-2.7.7.tar.gz |
Hadoop will be installed in “/usr/local/hadoop-2.7.7“, and this path will be reflected on all the steps outlined below.
Note: Alternatively, you can also use “brew install hadoop”. The Hadoop will be installed in the path “/usr/local/Cellar/hadoop”.
7. Make sure that the folders have the right permissions
1 2 3 | $ sudo chown -R <user>:admin /usr/local/hadoop-2.7.7/ |
8. Configure environment variables for Hadoop
Open ~/.bashrc or ~/.bash_profile file on your home directory and add following environment variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## java variables export JAVA_HOME=$(/usr/bin/java) ## hadoop variables export HADOOP_HOME=/usr/local/hadoop-2.7.7 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib” export PATH=$HADOOP_HOME/bin:$PATH export PATH=$HADOOP_HOME/sbin:$PATH |
Reload the configuration into memory so all new environment variables are effective
1 2 | source ~/.bash_profile |
Test it with:
1 | $ echo $HADOOP_HOME |
Output:
1 | /usr/local/hadoop-2.7.7 |
9. Check hadoop-env.sh
1 2 3 | $ cd $HADOOP_HOME/etc/hadoop $ vi hadoop-env.sh |
and set JAVA_HOME as shown below:
1 2 3 4 | # The java implementation to use. export JAVA_HOME=$(/usr/libexec/java_home) |
“~ /usr/libexec/java_home” is an OS X utility that allows you to easily generate a path to a JDK
10. Hadoop config core-site.xml
1 2 3 | $ cd $HADOOP_HOME/etc/hadoop $ sudo vi core-site.xml |
The “core-site.xml” should look like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-2.7.7/hdfs/tmp</value> <description>base for other temp directories</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
11. Hadoop config hdfs-site.xml
1 2 3 | $ cd $HADOOP_HOME/etc/hadoop $ sudo vi hdfs-site.xml |
The “hdfs-site.xml” set the replication factor to 1, which means don’t keep multiple copies of data blocks in our single node cluster. This is not recommended for production environments, where it must be set to least 3 to prevent data loss in case of node failures.
1 2 3 4 5 6 7 | <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
12. Hadoop resource manager config yarn-site.xml
1 2 3 | $ cd $HADOOP_HOME/etc/hadoop $ sudo vi yarn-site.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> |
13. Hadoop mapred-site.xml
1 2 3 | $ cd $HADOOP_HOME/etc/hadoop $ sudo vi mapred-site.xml |
YARN will take care of resource management on our single node cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <configuration> <property> <name>mapred.job.tracker</name> <value>yarn</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
14. Format the name node
Before starting the daemons, you must format the NameNode so it starts fresh.
1 2 3 | $ hdfs namenode –format |
15. SSH to localhost
In order to connect to Hadoop cluster, you need to setup a secure connection. Open System Preferences -> Sharing and enable Remote Login option.
If you have not already done it, you need to generate public/private key pairs and copy the public key to the authorized_keys as shown below. The keys will be generated in the “~/.ssh” folder. “~” means your Unix home directory.
1 2 3 4 | $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys |
Now see if you can ssh to the localhost.
1 2 3 | $ ssh localhost |
If you can ssh, you are good to move to the next step.
16. Start the namenode, datanode and the secondary namenode
1 2 3 | $ start-dfs.sh |
17. jps command to list the Java processes
1 2 3 | $ jps -lm |
Output:
1 2 3 4 5 6 | 3585 org.apache.hadoop.hdfs.server.namenode.NameNode 3683 org.apache.hadoop.hdfs.server.datanode.DataNode 4377 sun.tools.jps.Jps -lm 3805 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode |
18. start-yarn.sh to start NodeManager & ResourceManager
1 2 3 | $ start-yarn.sh |
1 2 3 | $ jps -lm |
Output:
1 2 3 4 5 6 7 8 | 3585 org.apache.hadoop.hdfs.server.namenode.NameNode 3938 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager 3683 org.apache.hadoop.hdfs.server.datanode.DataNode 4377 sun.tools.jps.Jps -lm 4041 org.apache.hadoop.yarn.server.nodemanager.NodeManager 3805 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode |
You can also install telnet with
1 2 | brew install telnet |
and test if you can connect to the name node with:
1 | telnet localhost 9000 |
The host & port are configured with the value “hdfs://localhost:9000” in the “core-site.xml” file.
We are now ready to perform Hadoop command-line commands with either “hadoop fs -xxxx ….” or “hdfs dfs -xxxx ….” commands. Refer to “Hadoop Commands Guide” for the full list of possible commands and options.
19. Create a new folder on HDFS
1 2 3 4 | $ hdfs dfs -mkdir -p /user/<user> $ hdfs dfs -ls / |
20. Add a file from local file system to HDFS
1 2 3 4 | $ touch test.csv $ vi test.csv |
The “test.csv” file data looks like
1 2 3 | Maths, 85 English, 94 Science, 75 |
1 2 3 4 | $ hdfs dfs -mkdir -p /data/marks $ hdfs dfs -put ./test.csv /data/marks |
21. The Web UIs
Hadoop Web interface http://localhost:50070/
JobTracker http://localhost:8088
Single Node http://localhost:8042/node
22. stop-xxx.sh to stop the servers
1 2 3 4 | $ stop-yarn.sh $ stop-dfs.sh |
23. start-all.sh & stop-all.sh
You can use the following commands to start or stop both hdfs & yarn services
1 2 3 4 | $ stop-all.sh $ start-all.sh |
In the previous tutorials I have covered how to install Scala, Scala IDE for Eclipse, Sbt (i.e. Scala Build Tool), and Spark.
In the next tutorial will use
Spark to read a file from the HDFS
(i.e. Hadoop Distributed File System). Getting started with Spark & Hadoop – client & cluster modes.
Hive to process data in HDFS
Setting up & getting started with Hive