Blog Archives
1 2

06: Spark Streaming with Flume Avro Sink Tutorial

This extends Running a Simple Spark Job in local & cluster modes and Apache Flume with JMS source (Websphere MQ) and HDFS sink. In this tutorial a Flume sink will ingest the data from a source like JMS, HDFS, etc and pass it to an “ … Read more ›...



07: spark-xml to split & read very large XML files

Processing very large XML files can be a bit tricky as they cannot be processed line by line in parallel as you would do with CSV files. The xml file has to be intact whilst matching the start and end entity tags, and if the tags are distributed in parts...



08: Spark writing RDDs to multiple text files & HAR to solve small files issue

We know that the following code snippets in Spark will write each JavaRDD element to a single file What if you want to write each employee history to a separate file? Step 1: Create a JavaPairRDD from JavaRDD Step 2: Create a MultipleOutputFormat, … Read more ›...



09: Append to AVRO from Spark with distributed Zookeeper locking using Apache’s Curator framework

Step 1: The pom.xml file that has all the relevant dependencies to Spark, Avro & hadoop libraries. Step 2: Avro schema /schema/employee.avsc file under src/main/resources folder. Step 3: Spark job that creates random data into a RDD named “ … Read more ›...



10: Spark RDDs to HBase & HBase to Spark RDDs

Step 1: pom.xml with library dependencies. It is important to note that 1) “https://repository.cloudera.com/artifactory/cloudera-repos/” is added as the “Cloudera Maven Repository” and 2) hbase-spark dependency is used for writing to HBase from Spark RDDs & … Read more ›...



11: Spark streaming with “textFileStream” simple tutorial

Using Spark streaming data can be ingested from many sources like Kafka, Flume, HDFS, Unix/Windows File system, etc. In this example, let’s run the Spark in a local mode to ingest data from a Unix file system. Step 1: The pom.xml file. Using textFileStream(..) textFileStream watches a directory for new...



12: Spark streaming with “fileStream” and “PortableDataStream” simple tutorial

This extends the Spark streaming with “textFileStream” simple tutorial to use fileStream(…) and PortableDataStream. The pom.xml file is same as the previous Spark streaming tutorial. Step 1: Using “fileStream(…)”. What if you want to process the files already in the folder when the streaming job started?… Read more ›...



1 2

Java Interview FAQs

800+ Java Interview Q&As

Top