Blog Archives
1 2

01: Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox. I am using VMWare. Cloudera requires at least 8GB RAM and 16GB is...



01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs. Step 2: The Spark job that writes numbers 1 to 10 to 10 different files on HDFS. Step 3: Build the “jar” … Read more ›...



02: Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox.

Read more ›



03: Spark tutorial – reading a Sequence File from HDFS

This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like CSV, Sequence files do not store meta data, hence only schema evolution is appending new fields to the end...



04: Running a Simple Spark Job in local & cluster modes

Step 1: Create a simple maven Spark project using “-B” for non-interactive mode. Step 2: Import the maven project “simple-spark” into eclipse. Step 3: The pom.xml file should have the relevant dependency jars as shown below. Step 4: Write the simple Spark job “SimpleSparkJob.java” … Read more ›...



05: Spark SQL & CSV with DataFrame Tutorial

Step 1: Create a simple maven project. Step 2: Import the “simple-spark” maven project into eclipse or IDE of your choice. Step 3: Modify the pom.xml file include 1) relevant Spark libraries 2) The shade plugin to create a single jar (i.e. … Read more ›...



05a: Spark DataFrame simple tutorial

A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes processing easier by imposing a structure onto a distributed collection of data. From Spark 2.0 onwards, … Read more...



1 2

Java Interview FAQs

800+ Java Interview Q&As

Top