2. Apache Pig: Regex (Regular expressions)

This extends the tutorial 1. Apache Pig Getting started.

Input Data

scores.xml in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:

Step 1: Start pig in local file system mode.

Step 2: Extract the “Subjects” from the input XML file.

Dump the output:

Step 3: Regex to extract each “Subject” and its corresponding marks.

Dump it:

Step 4: Put it all into a single “marks_by_subjects.pig” script.

Run the above pig script:


If you run without “-x local” option it runs in “Map-Reduce” mode against the HDFS (e.g. hdfs://localhost:9000). The name and data nodes need to be running. Otherwise you will get “Caused by: java.net.ConnectException: Connection refused” error.

Mapreduce mode

🔥 300+ Java Interview FAQs

Java & Big Data Tutorials