This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare …
This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare …
This extends Spark tutorial – writing a file from a local file system to HDFS.
This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial …
This extends Spark submit – reading a file from HDFS. A SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. Like …
Step 1: Create a simple maven Spark project using “-B” for non-interactive mode.
1 2 3 |
[cloudera@quickstart ~]$ |
Step 1: Create a simple maven project.
1 2 3 |
mvn archetype:generate -B -DgroupId=com.mytutorial -DartifactId= |
A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns of a table in a relational database. This makes …
This extends Running a Simple Spark Job in local & cluster modes and Apache Flume with JMS source (Websphere MQ) and HDFS sink. In this tutorial a Flume sink will ingest …
Processing very large XML files can be a bit tricky as they cannot be processed line by line in parallel as you would do with CSV files. The xml file has …
We know that the following code snippets in Spark will write each JavaRDD element to a single file
1 2 3 |
employeesRdd.saveAsTextFile(pathToHdfs) |
What …
Step 1: The pom.xml file that has all the relevant dependencies to Spark, Avro & hadoop libraries.
Step 1: pom.xml with library dependencies. It is important to note that 1) “https://repository.cloudera.com/artifactory/cloudera-repos/” is added as the “Cloudera Maven Repository” and 2) hbase-spark dependency is used for writing to HBase …
Using Spark streaming data can be ingested from many sources like Kafka, Flume, HDFS, Unix/Windows File system, etc. In this example, let’s run the Spark in a local mode to ingest …