08. Setting up & getting started with Hadoop on Mac

This tutorial outlines the basic steps to get started with Hadoop on Mac OS. This is a single-node cluster with YARN as the resource manager.

1. Install Xcode

Xcode can be installed via Apple appstore. Xcode is Apple’s Integrated Development Environment (IDE). Xcode is a large suite of software development tools and libraries from Apple.

2. Install the Apple command line tools

Once Xcode is installed, install the command line tools via “Xcode menu” –> “preferences” –> “command lines tools”, and click the install button. This may take a while to install. Once installed you can verify with a Terminal window

The Xcode Command Line Tools are part of XCode. The Xcode Command Line Tools include a GCC compiler, and many common Unix-based tools require the GCC compiler

3. Install homebrew

Homebrew is a package manager for OS. On a Terminal window type

Verify if brew is installed properly by typing the following on a Terminal window:

4. Install Java

Hadoop is an open source software built on Java.

Also, you will need Scala for Spark if you choose to code Spark in Scala. You can also code in Java.

5. Install wget

Wget is a handy tool to download files using http, https or ftp protocols from internet.

You can check in which folder brew has installed with

6. Download Hadoop with wget

Spark with Hadoop download version

As we were using Spark 2.3.2 in the earlier tutorials, let’s download hadopp 2.7.7 from http://mirror.nohup.it/apache/hadoop/common.

Hadoop will be installed in “/usr/local/hadoop-2.7.7“, and this path will be reflected on all the steps outlined below.

Note: Alternatively, you can also use “brew install hadoop”. The Hadoop will be installed in the path “/usr/local/Cellar/hadoop”.

7. Make sure that the folders have the right permissions

8. Configure environment variables for Hadoop

Open ~/.bashrc or ~/.bash_profile file on your home directory and add following environment variables.

Reload the configuration into memory so all new environment variables are effective

Test it with:

Output:

9. Check hadoop-env.sh

and set JAVA_HOME as shown below:

“~ /usr/libexec/java_home” is an OS X utility that allows you to easily generate a path to a JDK

10. Hadoop config core-site.xml

The “core-site.xml” should look like

11. Hadoop config hdfs-site.xml

The “hdfs-site.xml” set the replication factor to 1, which means don’t keep multiple copies of data blocks in our single node cluster. This is not recommended for production environments, where it must be set to least 3 to prevent data loss in case of node failures.

12. Hadoop resource manager config yarn-site.xml

13. Hadoop mapred-site.xml

YARN will take care of resource management on our single node cluster.

14. Format the name node

Before starting the daemons, you must format the NameNode so it starts fresh.

15. SSH to localhost

In order to connect to Hadoop cluster, you need to setup a secure connection. Open System Preferences -> Sharing and enable Remote Login option.

If you have not already done it, you need to generate public/private key pairs and copy the public key to the authorized_keys as shown below. The keys will be generated in the “~/.ssh” folder. “~” means your Unix home directory.

Now see if you can ssh to the localhost.

If you can ssh, you are good to move to the next step.

16. Start the namenode, datanode and the secondary namenode

17. jps command to list the Java processes

Output:

18. start-yarn.sh to start NodeManager & ResourceManager

Output:

You can also install telnet with

and test if you can connect to the name node with:

The host & port are configured with the value “hdfs://localhost:9000” in the “core-site.xml” file.

We are now ready to perform Hadoop command-line commands with either “hadoop fs -xxxx ….” or “hdfs dfs -xxxx ….” commands. Refer to “Hadoop Commands Guide” for the full list of possible commands and options.

19. Create a new folder on HDFS

20. Add a file from local file system to HDFS

The “test.csv” file data looks like

21. The Web UIs

Hadoop Web interface http://localhost:50070/

Hadoop Web Interface

JobTracker http://localhost:8088

JobTracker Web UI

Single Node http://localhost:8042/node

22. stop-xxx.sh to stop the servers

23. start-all.sh & stop-all.sh

You can use the following commands to start or stop both hdfs & yarn services

In the previous tutorials I have covered how to install Scala, Scala IDE for Eclipse, Sbt (i.e. Scala Build Tool), and Spark.

In the next tutorial will use

Spark to read a file from the HDFS

(i.e. Hadoop Distributed File System). Getting started with Spark & Hadoop – client & cluster modes.

Hive to process data in HDFS

Setting up & getting started with Hive

Learn more about Hadoop Q&As style

Hadoop overview Interview FAQs

HDFS Interview FAQs

Hadoop MapReduce Interview FAQs


Java Developer Interview Q&As

800+ Java Interview Q&As

Top