Blog Archives
Page 1 of 2
1 2

01: Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare …



01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs.



02: Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial …



03: Spark tutorial – reading a Sequence File from HDFS

This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like …



04: Running a Simple Spark Job in local & cluster modes

Step 1: Create a simple maven Spark project using “-B” for non-interactive mode.



05: Spark SQL & CSV with DataFrame Tutorial

Step 1: Create a simple maven project.



05a: Spark DataFrame simple tutorial

A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes …



06: Spark Streaming with Flume Avro Sink Tutorial

This extends Running a Simple Spark Job in local & cluster modes and Apache Flume with JMS source (Websphere MQ) and HDFS sink. In this tutorial a Flume sink will ingest …



07: spark-xml to split & read very large XML files

Processing very large XML files can be a bit tricky as they cannot be processed line by line in parallel as you would do with CSV files. The xml file has …



08: Spark writing RDDs to multiple text files & HAR to solve small files issue

We know that the following code snippets in Spark will write each JavaRDD element to a single file

What …



09: Append to AVRO from Spark with distributed Zookeeper locking using Apache’s Curator framework

Step 1: The pom.xml file that has all the relevant dependencies to Spark, Avro & hadoop libraries.



10: Spark RDDs to HBase & HBase to Spark RDDs

Step 1: pom.xml with library dependencies. It is important to note that 1) “https://repository.cloudera.com/artifactory/cloudera-repos/” is added as the “Cloudera Maven Repository” and 2) hbase-spark dependency is used for writing to HBase …



11: Spark streaming with “textFileStream” simple tutorial

Using Spark streaming data can be ingested from many sources like Kafka, Flume, HDFS, Unix/Windows File system, etc. In this example, let’s run the Spark in a local mode to ingest …



Page 1 of 2
1 2

800+ Java Interview Q&As Menu

Prepare to fast-track & go places
with multi-offers to choose from & increased earning potential. Expand your horizons along the way by taking the road less travelled.

Career Paths as a Developer

Learn by categories on the go...
Learn by categories such as FAQs – Core Java, Key Area – Low Latency, Core Java – Java 8, JEE – Microservices, Big Data – NoSQL, etc. Some posts belong to multiple categories.
Top