Blog Archives
1 2

01: Learn Hadoop API by examples in Java

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc.

What is Hadoop & HDFS? Hadoop based data hub architecture & basics | Hadoop eco system basics Q&As style.

Read more ›



02: Learn Spark & AVRO Write & Read in Java by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. AVRO (i.e row oriented) and Parquet (i.e. column oriented) file formats are HDFS (i.e. Hadoop Distributed File System) friendly binary data formats as they store data compressed...



02a: Learn Spark writing to Avro using Avro IDL

What is Avro IDL? Avro IDL (i.e. Interface Description Language) is a high-level language to write Avro schemata. You can generate Java, C++, and Python objects from the Avro IDL files. These files generally have the “.avdl” extension. Step 1: Write the “order.avdl” … Read more ›...



03: Learn Spark & Parquet Write & Read in Java by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. AVRO (i.e row oriented) and Parquet (i.e. column oriented) file formats are HDFS (i.e. Hadoop Distributed File System) friendly binary data formats as they store data compressed...



04: Learn how to connect to HBase from Spark using Java API

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is HBase? Apache HBase is a NoSQL database used for random and real-time read/write access to your Big Data. It is built on top of the...



05: Learn Hive to write to and read from AVRO & Parquet files by examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is Apache Hive? Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. … Read more ›...



06: Learn how to access Hive from Spark via SparkSQL & Dataframes by example

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. This example extends Learn Hive to write to and read from AVRO & Parquet files by examples to access Hive metastore via Spark SQL. … Read more...



07: Learn Spark Dataframes to do ETL in Java with examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. What is a Spark Dataframe? A DataFrame is an immutable distributed collection of data like an RDD, but unlike an RDD, data is organized into named columns...



08: Learn Spark how to convert RDD in Java to Dataframe with examples

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. Java RDD to Dataframe The following code reads a text file as shown below into a Java RDD. orders.txt The Java code that read the text file...



09: Running a Spark job on YARN cluster in Cloudera

This assumes that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc. It is also important to enable History server as per Before running a Spark job on YARN. Step 1: Open a “ … Read more ›...



10: Solving AlreadyBeingCreatedException & LeaseExpiredException thrown from your Spark jobs

What is wrong with the following Spark code snippet? You are likely to get AlreadyBeingCreatedException & LeaseExpiredException thrown as multiple executors try to either create or append to the same file in HDFS in parallel. HDFS allows only one writer. How do you fix the above issue?… Read more ›...



11. What are part- files in Hadoop & 6 ways to merge them

What are the part-xxxx files generated by Hadoop? When you invoke rdd.saveAsTextFile(…) or rdd.saveAsNewAPIHadoopFile(…) from Spark you will get part- files. When you do “INSERT INTO” command in Hive, the execution results in multiple part files in HDFS. You will have one part-xxxx file per partition in the RDD you...



12: XML Processing in Spark with XmlInputFormat

Step 1: Read the XML snippet in between the tags “<Record>”. Upload this file to HDFS “/user/cloudera/xml/orders.xml”. Step 2: You need the XmlInputFormat class as shown below. You can find this in the Mahout library. The following class works with the new Hadoop API. … Read more ›...



1 2

800+ Java Q&As & tutorials

Top