Input Data & How Hadoop reads the Data
scores.data in folder:/Users/arulk/projects
1 2 3 4 5 6 | Science, 80, 75, 89, 90 Maths, 90, 87, 78, 92 English, 78, 88, 65, 99 |
Mapper Input
The Hadoop “org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat” class read the input as key/value pairs. The default delimiter is tab. Our data is using “,“. You can change this with the “mapreduce.input.keyvaluelinerecordreader.key.value.separator” property. as in
1 2 | Configuration conf = new Configuration(); conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ","); |

MapReduce reading values as key/value pairs using KeyValueTextInputFormat
Mapper Output
The mapper will go through each mark in comma separated values like 80, 75, 89, 90 and convert them to 80_75_89_90_.

Mapper output key/value pairs.
Reducer Input
Same as mapper output.
Reducer Output
The reducer splits the “_” delimited values 80_75_89_90_ into a string array [80, 75, 89, 90] and finds out the max score for each key (i.e. each subject like Science) and stores the value as “max score is: 90”.

Hadoop Reducer output
Hadoop MapReduce Steps
Step 1: Create a very simple maven project using Maven in a Unix command prompt. Press enter for all the questions.
1 2 3 | mvn archetype:generate -DgroupId=com.mytutorial -DartifactId=simple-hadoop-mapreduce |
Import the new maven project into eclipse or IDE of your choice.
Step 2: Add Hadoop dependency to the pom.xml.
1 2 3 4 5 6 7 8 9 10 | <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> <scope>provided</scope> </dependency> |
Step 3: The Hadoop based mapper class “ScoreMapper” that can be executed in parallel by multiple nodes. It processe each input line as key/value pairs. E.g Science/80, 75, 89, 90.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | package com.mytutorial; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class ScoreMapper extends Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString(), ","); String mappedString = ""; while (st.hasMoreElements()) { String score = (String) st.nextElement(); mappedString += score.trim() + "_"; } context.write(key, new Text(mappedString)); } } |
Step 4: The Hadoop based reducer class “ScoreReducer” that can be executed in parallel by multiple nodes. It processe each input line as key/value pairs. E.g Science/80_75_89_90_. The output key/value pairs will be E.g Science/max score is: 90
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | package com.mytutorial; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class ScoreReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException { String mapString = null; String[] split = null; long tempValue = 0; long maxScore = 0; for (Text val : values) { mapString = val.toString(); split = mapString.split("_"); for (int i = 0; i < split.length; i++) { tempValue = new Long(split[i].trim()).longValue(); if(tempValue > maxScore){ maxScore = tempValue; } } } context.write(key, new Text("max score is: " + maxScore)); } } |
Step 5: Finally the executable main Java class “MaxScoreMain” that ties everything together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | package com.mytutorial; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class MaxScoreMain { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "maxscoremain"); job.setJarByClass(MaxScoreMain.class); job.setMapperClass(ScoreMapper.class); job.setReducerClass(ScoreReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/Users/arulk/projects/scores.data")); FileOutputFormat.setOutputPath(job, new Path("/Users/arulk/tempMapreduce")); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } |
Step 6: The results will be written out to the folder: “/Users/arulk/tempMapreduce” in a file named “part-r-00000“. The contents of this file will be:
1 2 3 4 5 6 7 | English max score is: 99 Maths max score is: 92 Science max score is: 90 |
This tutorial was created on a Unix environment. You may have additional challenges running on a Windows machine, and you need to have a Unix emulator like Cygwin to run.