This extends the tutorial 1. Apache Pig Getting started.
Input Data
scores.xml in folder:/Users/arulk/projects representing marks of 4 students in 3 subjects:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | <scores> <subject> <name>Science</name> <marks> <mark>80</mark> <mark>75</mark> <mark>89</mark> <mark>90</mark> </marks> </subject> <subject> <name>Maths</name> <marks> <mark>90</mark> <mark>87</mark> <mark>78</mark> <mark>92</mark> </marks> </subject> <subject> <name>English</name> <marks> <mark>78</mark> <mark>88</mark> <mark>65</mark> <mark>99</mark> </marks> </subject> </scores> |
Step 1: Start pig in local file system mode.
1 2 3 | pig -x local |
Step 2: Extract the “Subjects” from the input XML file.
1 2 3 | grunt> SUBJECTS_EXTRACT = LOAD '/Users/arulk/projects/scores.xml' using org.apache.pig.piggybank.storage.XMLLoader('subject') as (xmlContents:chararray); |
Dump the output:
1 2 3 | grunt> dump SUBJECTS_EXTRACT; |
1 2 3 4 5 | (<subject> <name>Science</name> <marks> <mark>80</mark> <mark>75</mark> <mark>89</mark> <mark>90</mark> </marks> </subject>) (<subject> <name>Maths</name> <marks> <mark>90</mark> <mark>87</mark> <mark>78</mark> <mark>92</mark> </marks> </subject>) (<subject> <name>English</name> <marks> <mark>78</mark> <mark>88</mark> <mark>65</mark> <mark>99</mark> </marks> </subject>) |
Step 3: Regex to extract each “Subject” and its corresponding marks.
1 2 3 | grunt> MARKS_FOR_SUBJECT_CSV = foreach SUBJECTS_EXTRACT GENERATE FLATTEN(REGEX_EXTRACT_ALL(xmlContents,'<subject>\\s*<name>(.*)</name>\\s*<marks>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*</marks>\\s*</subject>')); |
1 2 3 | grunt> dump MARKS_FOR_SUBJECT_CSV; |
Dump it:
1 2 3 4 5 | (Science,80,75,89,90) (Maths,90,87,78,92) (English,78,88,65,99) |
Step 4: Put it all into a single “marks_by_subjects.pig” script.
1 2 3 4 5 6 7 8 9 | SUBJECTS_EXTRACT = LOAD '/Users/arulk/projects/scores.xml' using org.apache.pig.piggybank.storage.XMLLoader('subject') as (xmlContents:chararray); MARKS_FOR_SUBJECT_CSV = foreach SUBJECTS_EXTRACT GENERATE FLATTEN(REGEX_EXTRACT_ALL(xmlContents,'<subject>\\s*<name>(.*)</name>\\s*<marks>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*<mark>(.*)</mark>\\s*</marks>\\s*</subject>')); dump MARKS_FOR_SUBJECT_CSV |
Run the above pig script:
1 2 3 | $ pig -x local marks_by_subjects.pig |
Outputs:
1 2 3 4 5 | (Science,80,75,89,90) (Maths,90,87,78,92) (English,78,88,65,99) |
If you run without “-x local” option it runs in “Map-Reduce” mode against the HDFS (e.g. hdfs://localhost:9000). The name and data nodes need to be running. Otherwise you will get “Caused by: java.net.ConnectException: Connection refused” error.
Mapreduce mode
1 2 3 4 | pig -x mapreduce pig |