Blog Archives
1 2

01: Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox. I am using VMWare. Cloudera requires at least 8GB RAM and 16GB is...

Members Only Content
Log In Register Home


01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs. Step 2: The Spark job that writes numbers 1 to 10 to 10 different files on HDFS. Step 3: Build the “jar” … Read more ›...

Members Only Content
Log In Register Home


02: Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox.

Read more ›



03: Spark tutorial – reading a Sequence File from HDFS

This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like CSV, Sequence files do not store meta data, hence only schema evolution is appending new fields to the end...

Members Only Content
Log In Register Home


04: Running a Simple Spark Job in local & cluster modes

Step 1: Create a simple maven Spark project using “-B” for non-interactive mode. Step 2: Import the maven project “simple-spark” into eclipse. Step 3: The pom.xml file should have the relevant dependency jars as shown below. Step 4: Write the simple Spark job “SimpleSparkJob.java” … Read more ›...

Members Only Content
Log In Register Home


05: Spark SQL & CSV with DataFrame Tutorial

Step 1: Create a simple maven project. Step 2: Import the “simple-spark” maven project into eclipse or IDE of your choice. Step 3: Modify the pom.xml file include 1) relevant Spark libraries 2) The shade plugin to create a single jar (i.e. … Read more ›...

Members Only Content
Log In Register Home


05a: Spark DataFrame simple tutorial

A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes processing easier by imposing a structure onto a distributed collection of data. From Spark 2.0 onwards, … Read more...

Members Only Content
Log In Register Home


1 2

800+ Java Interview Q&As Menu

Learn by categories on the go...
Learn by categories such as FAQs – Core Java, Key Area – Low Latency, Core Java – Java 8, JEE – Microservices, Big Data – NoSQL, Architecture – Distributed, Big Data – Spark, etc. Some posts belong to multiple categories.
Top