01: Apache Hadoop HDFS Tutorial

Step 1: Download the latest version of “Apache Hadoop common” from http://apache.claz.org/hadoop using wget, curl or a browser. This tutorial uses “http://apache.claz.org/hadoop/core/hadoop-2.7.1/”.

Step 2: You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

You can run this in a Unix command prompt as

Step 3: You can verify if Hadoop has been setup properly with

Step 4: The Hadoop file in $HADOOP_HOME/etc/Hadoop/hadoop-env.sh has the JAVA_HOME setting.

Step 5: The Hadoop file in $HADOOP_HOME/etc/Hadoop/core-site.xml

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.

Step 6: The Hadoop file in $HADOOP_HOME/etc/Hadoop/hdfs-site.xml.

The hdfs-site.xml file contains information about replication factor, namenode path, and datanode path of your local file systems. This is the Hadoop infrastructure.

Step 6: The Hadoop file in $HADOOP_HOME/etc/Hadoop/yarn-site.xml.

Configures YARN (Yet Another Resource Negotiator) into your site. It is aka MapReduce 2.0. YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.

Step 7: The Hadoop file in $HADOOP_HOME/etc/Hadoop/mapred-site.xml.

This file indicates which MapReduce framework to use. Firstly copy the file from mapred-site.xml.template to mapred-site.xml file using the “cp” command.

Step 8: Format the namenode. Namenode holds the meta data about where the actual data is stored in the data nodes.

Hadoop HDFS

Hadoop HDFS

Step 9: Start the Hadoop file system with “start-dfs.sh

To check if all services have started:

Which outputs:

Ignore the warning: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable.

Step 10: Start YARN with “start-yarn.sh

Note: Hadoop 2.2.0 onwards which has YARN framework, there is no jobtracker in it. Its functionality is split and replaced by ResourceManager and NodeManager.

Step 11: The Hadoop can be accessed via a browser using the URL: “http://http://localhost:50070“. The Hadoop cluster details can be accessed via the URL: “http://localhost:8088“.

Step 12: You can now create files, list file, put a file from local file system to hadoop file system, get a file from hadoop file system to local file system, etc.

Hdfs is not a posix file system and you have to use hadoop api as shown below.

To list the files:

To create a directory:

List the files again:

You will get:

Create a local file and put it to the Hadoop file system.

Outputs: some text

Getting a file from Hadoop file system to local file system

Step 13: To stop the Hadoop file system

Java Developer Interview Q&As

800+ Java Interview Q&As