You are currently viewing How to calculate Median and Standard deviation using MapReduce

How to calculate Median and Standard deviation using MapReduce

MapReduce is a programming model used to perform data analysis on large amounts of data in a scalable manner without any data loss. We can perform many different types of analysis. Today we will look for the analysis where we will perform median and standard deviation calculations. So, let’s go through step by step.

Before starting actual implementation, we should be aware of the concepts we are planning to implement. So, we are going to calculate two statistics

  • Median- It is used to find the middle value of the dataset listed in ascending order.
  • Standard Deviation- It is a measure of how dispersed the data is in relation to the mean or average of numbers

MapReduce diagram to calculate Median and standard deviation

MapReduce Architecture to find median and Standard deviation

Explanation of the diagram

As we can see in a diagram, we are reading a file line by line into chunks or blocks. We are performing map operations on each block and retrieving department and salary detail and writing output to the reducer. In the reducer phase, we have done the actual median and standard deviation calculations.

Problem statement

We have given the data regarding the count of the number of fire and police cases in a month. You have to make a MapReduce program to calculate the median and standard deviation.

Steps to implement

Any Hadoop map-reduce program generally has 3 parts. One is Driver, One is Mapper and One is Reducer. let’s see Driver class first. Driver class is as per the following manner. We use driver class to check our map-reduce program works fine as per the expectations. Apart from that, we are also setting configuration about which is mapper and reducer class. What is the input and output format? What are the mapper key and value types for mapper output and reducer output? Here is the actual implementation of the Driver class.

Create a Mapper class

Step1 – Create a file named Mapper.java and import all the required libraries

import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

Step 2- Create a StdMapper class

In a mapper class, we will read through data one record at a time. Data is regarding student performance in intermediate exams. So, we will calculate the subject-wise median and standard deviation of marks. We are reading subject-wise and marks based on index and writing as a mapper output which will eventually go to the reducer as an input.

public class StdMapper extends Mapper<Object, Text, Text, DoubleWritable> {

	private Text departmentName = new Text();
	private DoubleWritable salaryWritable = new DoubleWritable(0);
	@Override
	public void map(Object key, Text value, Context context) 	throws IOException, InterruptedException {
		String data = value.toString();
		String[] field = data.split(",", -1);
		double salary = 0;
		if (null != field && field.length == 9 && 			field[7].length() > 0) {
			salary = Double.parseDouble(field[7]);
			departmentName.set(field[3]);
			context.write(departmentName, new DoubleWritable(salary));
		}
	}
}

Create a Reducer class

Step 1 – Create a file named Reducer.java and import all the required libraries

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

Here in the reducer phase, we are reading output from the mapper phase, and it will retrieve a grade-wise list of marks. We have calculated the median and standard deviation and set the output in the output file.

public class StdReducer
extends Reducer<Text, DoubleWritable, Text, Text> {
	public List<Double> list = new ArrayList<Double>();

	public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
		double sum = 0;
		double count = 0;
		list.clear();
		for (DoubleWritable doubleWritable : values) {
			sum = sum + doubleWritable.get();
			count = count + 1;
			list.add(doubleWritable.get());
		}
		Collections.sort(list);
		int length = list.size();
		double median = 0;
		if (length == 2) {
			double medianSum = list.get(0) + list.get(1);
			median = medianSum / 2;
		} else if (length % 2 == 0) {
			double medianSum = list.get((length / 2) -1) + 	list.get(length / 2);
			median = medianSum / 2;
		} else {
			median = list.get(length / 2);
		}
		double std_dev=0;
		double mean = sum / count;
		double sumOfSquares = 0;
		for (double doubleWritable : list) {
			sumOfSquares += (doubleWritable - mean) * (doubleWritable - mean);
		}
		std_dev = ((double) Math.sqrt(sumOfSquares / (count – 1)));
		context.write(key, new Text(median + “ ”+ String.valueof(std_dev)));
	}
}

Create a driver class

Step 3- Create a file named driver.java

Driver class is as per the following manner. We use driver class to check our map-reduce program works fine as per the expectations. Apart from that, we are also setting configuration about which mapper and reducer class. What is the input and output format? What are the mapper key and value type for mapper output and reducer output? Here is the actual implementation of the Driver class.

import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DriverClass {
    public static void main(String[] args) throws Exception {

        // Specify the input path location 
        args = new String[] {
            "Input Path location",
            "output Path location"
        };

        /* delete the output directory before running the job */
        FileUtils.deleteDirectory(new File(args[1]));

        /* set the hadoop system parameter */
        System.setProperty("hadoop.home.dir", "Replace this string with hadoop home directory location");

        if (args.length != 2) {
            System.err.println("Please specify the input and output path");
            System.exit(-1);
        }

        Configuration conf = ConfigurationFactory.getInstance();
        Job job = Job.getInstance(conf);
        job.setJarByClass(DriverStandardDeviation.class);
        job.setJobName("Standard Deviation Job");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(StdMapper.class);
        job.setReducerClass(StdReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Output

The output of the above program is shown down below

  • fire 91080 15035.98
  • police 96060 27624.39

here is the output with department, median, and standard deviation calculations.

Conclusion

Here we have seen how to calculate median and standard deviation using a map-reduce algorithm. I hope, you all will find it helpful to understand how Hadoop map-reduce is working to do such statistical analysis and you can do more complex data analysis based on the understanding of this implementation.

This Post Has 2 Comments

  1. israel night club

    When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve four emails with the exact same comment. Perhaps there is a way you can remove me from that service? Many thanks!

  2. Im very pleased to uncover this site. I need to to thank you for ones time just for this wonderful read!! I definitely enjoyed every part of it and i also have you book marked to look at new things on your website.

Comments are closed.