2. Apache Pig: Regex (Regular expressions)

This extends the tutorial 1. Apache Pig Getting started.

Input Data

scores.xml in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Step 1: Start pig in local file system mode.

Step 2: Extract the “Subjects” from the input XML file.

Dump the output:

Step 3: Regex to extract each “Subject” and its corresponding marks.

Dump it:

Step 4: Put it all into a single “marks_by_subjects.pig” script.

Run the above pig script:

Outputs:

If you run without “-x local” option it runs in “Map-Reduce” mode against the HDFS (e.g. hdfs://localhost:9000). The name and data nodes need to be running. Otherwise you will get “Caused by: java.net.ConnectException: Connection refused” error.

Mapreduce mode


Why & What are the benefits

🎯 Why java-success.com?

🎯 What are the benefits of Q&As approach?

Learn by categories such as FAQs – Core Java, Key Area – Low Latency, Core Java – Java 8, JEE – Microservices, Big Data – NoSQL, Architecture – Distributed, Big Data – Spark, etc. Some posts belong to multiple categories.

800+ Java & Big Data Q&As Menu

Top