Blog Archives

01: ♥ Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox. I am using VMWare. Cloudera requires at least 8GB RAM and 16GB is...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


01: ♥ Spark RDD joins in Scala tutorial

This tutorial extends Setting up Spark and Scala with Maven.

Step 1: Let’s take a simple example of joining a student to department. This will be written in an SQL world as:

Step 2: Let’s create classes to represent Student and Department data.

Read more ›



01: Apache Flume with JMS source (Websphere MQ) and HDFS sink

Apache Flume is used in the Hadoop ecosystem for ingesting data. In this example, let’s ingest data from Websphere MQ. Step 1: Apache flume is config driven. Hierarchy driven flume config flumeWebsphereMQQueue.conf file. You need to define the “source“, “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


01: Apache Hadoop HDFS Tutorial

Step 1: Download the latest version of “Apache Hadoop common” from http://apache.claz.org/hadoop using wget, curl or a browser. This tutorial uses “http://apache.claz.org/hadoop/core/hadoop-2.7.1/”.

Step 2: You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

You can run this in a Unix command prompt as

Step 3: You can verify if Hadoop has been setup properly with

Step 4: The Hadoop file in $HADOOP_HOME/etc/Hadoop/hadoop-env.sh has the JAVA_HOME setting.

Read more ›



01: Learn Hadoop API by examples in Java

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc.

What is Hadoop & HDFS? Hadoop based data hub architecture & basics | Hadoop eco system basics Q&As style.

Read more ›



01a: Convert XML file To Sequence File – writing & reading – Local File System

Sequence files are good for saving raw data into HDFS. Sequence files are compressible and splittable. It is also useful for combining a number of smaller files into a single say 64MB or larger sequence file as HDFS is more suited for larger files. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


01b: Convert XML file To Sequence File – writing & reading – Hadoop File System (i.e HDFS)

This extends Convert XML file To Sequence File – writing & reading – Local File System. Step 1: Upload “report.xml” onto HDFS. E.g using the Cloudera HUE on to path “/user/cloudera/report-data”. You need to create the “report-data” folder. The uploaded file on Hue: Step 2: Change the code to read...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs. Step 2: The Spark job that writes numbers 1 to 10 to 10 different files on HDFS. Step 3: Build the “jar” … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02: ♥ Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox.

Read more ›



02: Apache Flume with Custom classes for JMS Source & HDFS Sink

This post extends 01: Apache Flume with JMS source (Websphere MQ) and HDFS sink to write Flume customization code. We will be customizing 3 things. 1) Customized JMS Source message converter to capture the JMS headers “JMSMessageId” and “JMSCorrelationID” into FlumeEvent header. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02: Convert XML file To Sequence File with Apache Spark – writing & reading

This extends the Convert XML file To Sequence File With Hadoop libaries, by using Apache Spark. Step 1: The pom.xml file should include the Apache Spark libraries as shown below. Step 2: The XML file report.xml. Step 3: The Java class “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02: Java to write from/to Local to HDFS File System

This extends Hadoop MapReduce Basic Tutorial and Apache Hadoop HDFS Tutorial. This could have have been done on the command-line as shown below after running “start-dfs.sh” to start the name and data nodes.

The focus of this tutorial is to do the same via Java and Hadoop APIs.

Read more ›



02: Learn Spark & AVRO Write & Read in Java by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. AVRO (i.e row oriented) and Parquet (i.e. column oriented) file formats are HDFS (i.e. Hadoop Distributed File System) friendly binary data formats as they store data compressed...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02: Spark RDD grouping with groupBy & cogroup in Scala tutorial

This Spark tutorial extends Spark RDD joins in Scala tutorial and Setting up Spark and Scala with Maven. Step 1: Let’s take a simple example of joining a student to department. This will be written in an SQL world as: Step 2: Let’s create classes to represent Student and Department...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02a: Learn Spark writing to Avro using Avro IDL

What is Avro IDL? Avro IDL (i.e. Interface Description Language) is a high-level language to write Avro schemata. You can generate Java, C++, and Python objects from the Avro IDL files. These files generally have the “.avdl” extension. Step 1: Write the “order.avdl” … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


03: Convert XML file To an Avro File – writing & reading

This extends the Convert XML file To Sequence File With Hadoop libaries. Avro files are schema driven & support schema evolution, which means you can add new columns & modify existing columns. Step 1: The pom.xml file should include the Apache Spark libraries as shown below. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


03: Create or append a file to HDFS – Hadoop API tutorial

Step 1: Create a simple maven project named “simple-hadoop“. Step 2: Import the “simple-hadoop” maven project into eclipse or IDE of your choice. Step 3: Modify the pom.xml file include 1) relevant Hadoop libraries 2) The shade plugin to create a single jar (i.e. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


03: Learn Spark & Parquet Write & Read in Java by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. AVRO (i.e row oriented) and Parquet (i.e. column oriented) file formats are HDFS (i.e. Hadoop Distributed File System) friendly binary data formats as they store data compressed...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


03: Spark tutorial – reading a Sequence File from HDFS

This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like CSV, Sequence files do not store meta data, hence only schema evolution is appending new fields to the end...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


04: Convert XML file To an Avro File with Apache Spark – writing & reading

This extends Convert XML file To an Avro File – writing & reading. Step 1: The pom.xml file should include the Apache Spark & Avro libraries as shown below. Step 2: The report.xml file under “src/main/resources/data”. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


04: Create new or append to an existing AVRO file tutorial

This extends Create or append a file to HDFS – Hadoop API tutorial to write an AVRO file to HDFS. Step 1: Include the AVRO library files in the pom.xml file. Step 2: The AVRO files are schema based. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


04: Learn how to connect to HBase from Spark using Java API

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is HBase? Apache HBase is a NoSQL database used for random and real-time read/write access to your Big Data. It is built on top of the...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


04: Running a Simple Spark Job in local & cluster modes

Step 1: Create a simple maven Spark project using “-B” for non-interactive mode. Step 2: Import the maven project “simple-spark” into eclipse. Step 3: The pom.xml file should have the relevant dependency jars as shown below. Step 4: Write the simple Spark job “SimpleSparkJob.java” … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05: Convert XML file To an Avro File with avro-maven-plugin & Apache Spark

This extends 04: Convert XML file To an Avro File with Apache Spark – writing & reading. Instead of using the GenericRecord, let’s generate an avro schema object from the avro schema. Step 1: The pom.xml file should include the Apache Spark & … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05: Create or append a Sequence file to HDFS – Hadoop API tutorial

The following tutorial extends Create or append a file to HDFS – Hadoop API tutorial, and Create or append an AVRO file to HDFS – Hadoop & AVRO API tutorial. In this tutorial we will write to a Sequence file, … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05: Learn Hive to write to and read from AVRO & Parquet files by examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is Apache Hive? Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05: Spark SQL & CSV with DataFrame Tutorial

Step 1: Create a simple maven project. Step 2: Import the “simple-spark” maven project into eclipse or IDE of your choice. Step 3: Modify the pom.xml file include 1) relevant Spark libraries 2) The shade plugin to create a single jar (i.e. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05a: Spark DataFrame simple tutorial

A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes processing easier by imposing a structure onto a distributed collection of data. From Spark 2.0 onwards, … Read more...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


06: Avro Schema evolution tutorial

Q1. What do you understand by the term “AVRO schema evolution“?
A1. Schema evolution is the term used for how the store behaves when Avro schema is changed after data has been written to the store using an older version of that schema.

Read more ›



06: Learn how to access Hive from Spark via SparkSQL & Dataframes by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. This example extends Learn Hive to write to and read from AVRO & Parquet files by examples to access Hive metastore via Spark SQL. … Read more...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


06: Spark Streaming with Flume Avro Sink Tutorial

This extends Running a Simple Spark Job in local & cluster modes and Apache Flume with JMS source (Websphere MQ) and HDFS sink. In this tutorial a Flume sink will ingest the data from a source like JMS, HDFS, etc and pass it to an “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


07: Avro IDL (e.g. avdl) to Java objects & Avro Schemas (i.e. avsc) tutorial

Avro IDL (i.e Interface Definition Language) schema can be specified with two type of files “avpr” (i.e. AVro PRotocol file) & “avdl” (i.e. AVro iDL). Step 1: Create a maven based Java project from a command-line Step 2: Import it into eclipse as a maven project. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


07: Learn Spark Dataframes to do ETL in Java with examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is a Spark Dataframe? A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


07: spark-xml to split & read very large XML files

Processing very large XML files can be a bit tricky as they cannot be processed line by line in parallel as you would do with CSV files. The xml file has to be intact whilst matching the start and end entity tags, and if the tags are distributed in parts...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


08: Learn Spark how to convert RDD in Java to Dataframe with examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. Java RDD to Dataframe The following code reads a text file as shown below into a Java RDD. orders.txt The Java code that read the text file...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


08: Spark writing RDDs to multiple text files & HAR to solve small files issue

We know that the following code snippets in Spark will write each JavaRDD element to a single file What if you want to write each employee history to a separate file? Step 1: Create a JavaPairRDD from JavaRDD Step 2: Create a MultipleOutputFormat, … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


09: Append to AVRO from Spark with distributed Zookeper locking using Apache’s Curator framework

Step 1: The pom.xml file that has all the relevant dependencies to Spark, Avro & hadoop libraries. Step 2: Avro schema /schema/employee.avsc file under src/main/resources folder. Step 3: Spark job that creates random data into a RDD named “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


09: Running a Spark job on YARN cluster in Cloudera

This assumes that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. It is also important to enable History server as per Before running a Spark job on YARN. Step 1: Open a “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


1. Apache Pig Getting started

Input Data

scores.data in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Calculate the max mark for each subject.

Step 1: Download Apache Pig from http://apache.mirror.digitalpacific.com.au/pig/ and extract the tar file.

Read more ›



1. Hadoop MapReduce Basic Tutorial

Input Data & How Hadoop reads the Data

scores.data in folder:/Users/arulk/projects

Mapper Input

The Hadoop “org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat” class read the input as key/value pairs. The default delimiter is tab.

Read more ›



1. HBase Tutorial Getting Started

In standalone mode, HBase does not use HDFS — it uses the local filesystem instead — and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. Zookeeper binds to a well known port so clients may talk to HBase.

Step 1: Download HBase from “http://hbase.apache.org/”.

Read more ›



10: Solving AlreadyBeingCreatedException & LeaseExpiredException thrown from your Spark jobs

What is wrong with the following Spark code snippet? You are likely to get AlreadyBeingCreatedException & LeaseExpiredException thrown as multiple executors try to either create or append to the same file in HDFS in parallel. HDFS allows only one writer. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


10: Spark RDDs to HBase & HBase to Spark RDDs

Step 1: pom.xml with library dependencies. It is important to note that 1) “https://repository.cloudera.com/artifactory/cloudera-repos/” is added as the “Cloudera Maven Repository” and 2) hbase-spark dependency is used for writing to HBase from Spark RDDs & … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


11: Spark streaming with “textFileStream” simple tutorial

Using Spark streaming data can be ingested from many sources like Kafka, Flume, HDFS, Unix/Windows File system, etc. In this example, let’s run the Spark in a local mode to ingest data from a Unix file system. Step 1: The pom.xml file. Using textFileStream(..) textFileStream watches a directory for new...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


11. What are part- files in Hadoop & 6 ways to merge them

What are the part-xxxx files generated by Hadoop? When you invoke rdd.saveAsTextFile(…) or rdd.saveAsNewAPIHadoopFile(…) from Spark you will get part- files. When you do “INSERT INTO” command in Hive, the execution results in multiple part files in HDFS. You will have one part-xxxx file per partition in the RDD you...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


12: Spark streaming with “fileStream” and “PortableDataStream” simple tutorial

This extends the Spark streaming with “textFileStream” simple tutorial to use fileStream(…) and PortableDataStream. The pom.xml file is same as the previous Spark streaming tutorial. Step 1: Using “fileStream(…)”. What if you want to process the files already in the folder when the streaming job started?… Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


12: XML Processing in Spark with XmlInputFormat

Step 1: Read the XML snippet in between the tags “<Record>”. Upload this file to HDFS “/user/cloudera/xml/orders.xml”. Step 2: You need the XmlInputFormat class as shown below. You can find this in the Mahout library. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


12A. Installing & getting started with Cloudera QuickStart on VMWare for windows in 17 steps

Prerequisite: At least 12GB+ RAM (i.e. 4GB+ for operating system & 8GB+ for Cloudera), although 16 GB+ is preferred. 80GB Hard Disk. Cloudera runs on CentOS, which is the community edition of the Linux. Windows system must support 64-bit.

Install VMWare for Windows

Step 1: Download the VMWare player for Windows from https://my.vmware.com/web/vmware/free and then select VMWare Workstation Plyer.

Read more ›



13: Q98 – Q104 Hive Basics Interview Q&As and Tutorial

Q98. What is Hive? A98. Hive is used for accessing and analyzing data in Hadoop using SQL syntax. It is known as the HiveQL. Q99. What is the difference between Hive internal tables & external tables? A99. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


13: Spark inner & outer joins in Java with JavaPairRDDs

RDD inner join via JavaPairRDD Here is an inner join displaying all the orders with line items. Outputs: RDD left outer join with filtering via JavaPairRDD Here is a left outer join with filtering to display all the orders without any line items. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


14: Spark joins with SQLContext & JavaPairRDD

This extends the last tutorial where Spark inner & outer joins in Java with JavaPairRDDs. In this tutorial let’s read the orders via a Hive table using SQLContext & Dataframe. RDD left outer join with filtering via JavaPairRDD Here is a left outer join with filtering to display all the...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


15: Spark joins with Dataframes & SQLContext

Create LineItems Hive table Step 1: Create a file “line-item1.txt” on HDFS under “/user/cloudera/learn-hdfs/lineitems” as Step 2: You create a Hive table “lineitems” in the database “learnhadoop”. Step 3: The “orders” table can be created in a similar manner as shown above. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


2. Apache Pig: Regex (Regular expressions)

This extends the tutorial 1. Apache Pig Getting started.

Input Data

scores.xml in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Step 1: Start pig in local file system mode.

Read more ›



2. Hadoop MapReduce Basic Tutorial

This extends the Part 1 tutorial 1. Hadoop MapReduce Basic Tutorial. The key difference of this tutorial is using a “TextInputFormat” instead of “KeyValueTextInputFormat“.

TextInputFormat reads

The key as line offset number starting from 0 and the values as “Science,

Read more ›



2. HBase Shell commands

This extends 1. HBase Tutorial Getting Started. You can learn HBase basics at HBase interview Questions & Answers.

Step 1: Unlike relational databases, the NoSQL databases are semi-structured, hence you can add new columns on the fly. In HBase, you can define the table name and the column family first and then new columns for a column family can be added programmatically on the fly.

Read more ›



3. Apache Pig: XPath for XML

This extends the tutorial 1. Apache Pig Getting started and 2. Apache Pig: Regex (Regular expressions).

Input Data

scores.xml in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Step 1: Start pig in local file system mode.

Read more ›



3. Understanding HBase (NoSQL) database basics in 7 steps

Extends Getting started with HBase. HBase is a NoSQL database. HBase is a columnar database. #1. In HBase, you create the Table & column-families “create” command to create a table with column families. You can later on use “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


4. ♦ Setting up HBase with zookeeper to be used in Java via maven project

HBase is a NoSQL database used in Hadoop world to store “Big Data”. This extends Understanding HBase (NoSQL) database basics in 7 steps.

Step 1: Create a Maven based Java project

Step 2: Import the project into eclipse & modify the pom.xml file

to add Hadoop &

Read more ›



5. HBase atomic operations by examples

Step 1: Create an HBase table named “sequence_numbers” with one column-family “i”. You can create this via HBase shell. Step 2: You can use the HBase API in Java to insert a new record via “assigCurrentValue()“, and perform atomic operations in “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


Before running a Spark job on a YARN cluster in Cloudera and about the Spark history server

Problem: When you run a Spark job via “spark-submit” command on a “YARN” cluster as shown below in a terminal, It creates a folder and files in HDFS under The files will be named with “application ids” like “application_1510971789108_0015”. This will cause a permission issue, … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


MapReduce to HBase: Read from & write to

This tutorial extends 1. HBase Tutorial Getting Started & 1. Hadoop MapReduce Basic Tutorial to read an HBase table with data from the mapper and write the max marks for each subject to another HBase table from the reducer..

Input Data Marks secured by 4 students

Step 1: Let’s store this Data in HBase database as described in tutorial 1.

Read more ›



Understanding Cloudera Hadoop users

Step 1: A number of special users are created by default when installing and using CDH and Cloudera Manager. For example

Unix user id: hdfs
groups: hdfs hadoop

Unix user id: spark
groups: spark

Unix user id: hive
groups: hive

and so on.

Read more ›



By topics – 800+ Q&As ♥ Free ♦ FAQ

open all | close all

Java 200+ FAQs – Quick Brushup

open all | close all

100+ Java Tutorials step by step

open all | close all

13+ Tech Key Areas to standout

open all | close all

Java coding exercises

open all | close all
Top