1. Apache Pig Getting started

Input Data

scores.data in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Calculate the max mark for each subject.

Step 1: Download Apache Pig from http://apache.mirror.digitalpacific.com.au/pig/ and extract the tar file.

Step 2: Set up PIG_HOME and add $PIG_HOME/bin to the path via .profile or .bashrc.

Step 3: Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.

Step 4: Read the file “/Users/arulk/projects/scores.data”.

Note: If data type is not provided, the default data type is “byte array

The “student_marks” is known as a “relation“, and NOT a variable. Pig is a data flow language. A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table.

A tuple is just like a row in a table. It is comma separated list of fields.

A bag is an unordered collection of tuples.

Three handy commands to check the structure (aka schema), hoe the data flow was derived and the actual data are:

illustrate student_marks

illustrate student_marks

Step 5: Bag student mark columns.

Step 6: Flattening the Bag.

Step 7: Group it by Subject.

illustrate records_group

illustrate records_group

Step 8: Final result


Why & What are the benefits

🎯 Why java-success.com?

🎯 What are the benefits of Q&As approach?

Learn by categories such as FAQs – Core Java, Key Area – Low Latency, Core Java – Java 8, JEE – Microservices, Big Data – NoSQL, Architecture – Distributed, Big Data – Spark, etc. Some posts belong to multiple categories.

800+ Java & Big Data Q&As Menu

Top