Step 1: Download the latest version of “Apache Hadoop common” from http://apache.claz.org/hadoop using wget, curl or a browser. This tutorial uses “http://apache.claz.org/hadoop/core/hadoop-2.7.1/”.
Step 2: You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
1 2 3 4 5 6 7 8 9 10 11 12 13 | export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home export M3_HOME=~/tools/apache-maven-3.3.9 export HADOOP_HOME=~/hadoop-eco/hadoop-2.7.1 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$M3_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_INSTALL=$HADOOP_HOME |
You can run this in a Unix command prompt as
1 2 3 | $ source ~/.bashrc |
Step 3: You can verify if Hadoop has been setup properly with
1 2 3 | $ hadoop version |
Step 4: The Hadoop file in $HADOOP_HOME/etc/Hadoop/hadoop-env.sh has the JAVA_HOME setting.
Step 5: The Hadoop file in $HADOOP_HOME/etc/Hadoop/core-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/users/arulk/tmp/hadoop</value> <description>A base for other temporary directories.</description> </property> </configuration> |
The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.
Step 6: The Hadoop file in $HADOOP_HOME/etc/Hadoop/hdfs-site.xml.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.name.dir</name> <value>file:///myhadoop/home/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///myhadoop/home/hadoopinfra/hdfs/datanode </value> </property> </configuration> |
The hdfs-site.xml file contains information about replication factor, namenode path, and datanode path of your local file systems. This is the Hadoop infrastructure.
Step 6: The Hadoop file in $HADOOP_HOME/etc/Hadoop/yarn-site.xml.
1 2 3 4 5 6 7 8 | <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> |
Configures YARN (Yet Another Resource Negotiator) into your site. It is aka MapReduce 2.0. YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.
Step 7: The Hadoop file in $HADOOP_HOME/etc/Hadoop/mapred-site.xml.
1 2 3 4 5 6 7 8 | <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
This file indicates which MapReduce framework to use. Firstly copy the file from mapred-site.xml.template to mapred-site.xml file using the “cp” command.
Step 8: Format the namenode. Namenode holds the meta data about where the actual data is stored in the data nodes.
1 2 3 | $hdfs namenode -format |

Hadoop HDFS
Step 9: Start the Hadoop file system with “start-dfs.sh”
To check if all services have started:
1 2 3 | $jps |
Which outputs:
1 2 3 4 5 6 | 9618 NameNode 10075 Jps 9707 DataNode 9820 SecondaryNameNode |
Ignore the warning: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable.
Step 10: Start YARN with “start-yarn.sh”
Note: Hadoop 2.2.0 onwards which has YARN framework, there is no jobtracker in it. Its functionality is split and replaced by ResourceManager and NodeManager.
1 2 3 | $jps |
1 2 3 4 5 6 7 8 | 10290 Jps 9618 NameNode 10213 NodeManager 10121 ResourceManager 9707 DataNode 9820 SecondaryNameNode |
Step 11: The Hadoop can be accessed via a browser using the URL: “http://http://localhost:50070“. The Hadoop cluster details can be accessed via the URL: “http://localhost:8088“.
Step 12: You can now create files, list file, put a file from local file system to hadoop file system, get a file from hadoop file system to local file system, etc.
Hdfs is not a posix file system and you have to use hadoop api as shown below.
To list the files:
1 2 3 | $hadoop fs -ls / |
To create a directory:
1 2 3 | $hadoop fs -mkdir /hadoop-tutorial |
List the files again:
1 2 3 | $hadoop fs -ls / |
You will get:
1 2 3 4 | Found 1 items drwxr-xr-x - hadoop supergroup 0 2016-01-27 05:55 /hadoop-tutorial |
Create a local file and put it to the Hadoop file system.
1 2 3 4 5 6 | $touch ~/hadoop.test $echo "some text" > ~/hadoop.test $hadoop fs -put ~/hadoop.test /hadoop-tutorial/hadoop.test hadoop fs -cat /hadoop-tutorial/hadoop.test |
Outputs: some text
Getting a file from Hadoop file system to local file system
1 2 3 4 | hadoop fs -get /hadoop-tutorial/hadoop.test ~/hadoop.copy cat ~/hadoop.copy |
Step 13: To stop the Hadoop file system
1 2 3 | stop-dfs.sh |