Blog Archives

01: ♥ Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox. I am using VMWare. Cloudera requires at least 8GB RAM and 16GB is...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs. Step 2: The Spark job that writes numbers 1 to 10 to 10 different files on HDFS. Step 3: Build the “jar” … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


02: ♥ Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox.

Read more ›



03: Spark tutorial – reading a Sequence File from HDFS

This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like CSV, Sequence files do not store meta data, hence only schema evolution is appending new fields to the end...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


04: Running a Simple Spark Job in local & cluster modes

Step 1: Create a simple maven Spark project using “-B” for non-interactive mode. Step 2: Import the maven project “simple-spark” into eclipse. Step 3: The pom.xml file should have the relevant dependency jars as shown below. Step 4: Write the simple Spark job “SimpleSparkJob.java” … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05: Spark SQL & CSV with DataFrame Tutorial

Step 1: Create a simple maven project. Step 2: Import the “simple-spark” maven project into eclipse or IDE of your choice. Step 3: Modify the pom.xml file include 1) relevant Spark libraries 2) The shade plugin to create a single jar (i.e. … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


05a: Spark DataFrame simple tutorial

A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes processing easier by imposing a structure onto a distributed collection of data. From Spark 2.0 onwards, … Read more...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


06: Spark Streaming with Flume Avro Sink Tutorial

This extends Running a Simple Spark Job in local & cluster modes and Apache Flume with JMS source (Websphere MQ) and HDFS sink. In this tutorial a Flume sink will ingest the data from a source like JMS, HDFS, etc and pass it to an “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


07: spark-xml to split & read very large XML files

Processing very large XML files can be a bit tricky as they cannot be processed line by line in parallel as you would do with CSV files. The xml file has to be intact whilst matching the start and end entity tags, and if the tags are distributed in parts...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


08: Spark writing RDDs to multiple text files & HAR to solve small files issue

We know that the following code snippets in Spark will write each JavaRDD element to a single file What if you want to write each employee history to a separate file? Step 1: Create a JavaPairRDD from JavaRDD Step 2: Create a MultipleOutputFormat, … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


09: Append to AVRO from Spark with distributed Zookeper locking using Apache’s Curator framework

Step 1: The pom.xml file that has all the relevant dependencies to Spark, Avro & hadoop libraries. Step 2: Avro schema /schema/employee.avsc file under src/main/resources folder. Step 3: Spark job that creates random data into a RDD named “ … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


10: Spark RDDs to HBase & HBase to Spark RDDs

Step 1: pom.xml with library dependencies. It is important to note that 1) “https://repository.cloudera.com/artifactory/cloudera-repos/” is added as the “Cloudera Maven Repository” and 2) hbase-spark dependency is used for writing to HBase from Spark RDDs & … Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


11: Spark streaming with “textFileStream” simple tutorial

Using Spark streaming data can be ingested from many sources like Kafka, Flume, HDFS, Unix/Windows File system, etc. In this example, let’s run the Spark in a local mode to ingest data from a Unix file system. Step 1: The pom.xml file. Using textFileStream(..) textFileStream watches a directory for new...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


12: Spark streaming with “fileStream” and “PortableDataStream” simple tutorial

This extends the Spark streaming with “textFileStream” simple tutorial to use fileStream(…) and PortableDataStream. The pom.xml file is same as the previous Spark streaming tutorial. Step 1: Using “fileStream(…)”. What if you want to process the files already in the folder when the streaming job started?… Read more ›...

Members Only Content

This content is for the members with any one of the following paid subscriptions:

45-Day-Java-JEE-Career-Companion, 90-Day-Java-JEE-Career-Companion, 180-Day-Java-JEE-Career-Companion, 365-Day-Java-JEE-Career-Companion and 2-Year-Java-JEE-Career-Companion Log In | Register | Try free FAQs | Home


By topics – 800+ Q&As ♥ Free ♦ FAQ

open all | close all

Java 200+ FAQs – Quick Brushup

open all | close all

100+ Java Tutorials step by step

open all | close all

13+ Tech Key Areas to standout

open all | close all

Java coding exercises

open all | close all
Top