You are currently viewing How to find a distinct set of elements using  MapReduce
How to find distinct set of elements using MapReduce

How to find a distinct set of elements using MapReduce

MapReduce is a programming model used to perform a different type of analysis on a large amount of data. Today we will see how we can find a distinct list of words in any random document. Consider the following scenario: Martin Luther King Jr. delivers a speech. We will be able to comprehend the speech’s theme if we can identify a certain set of terms from it.

MapReduce diagram to find a distinct set of words in any document

The architecture of the MapReduce program is given down below

The map-reduce diagram to find a distinct set of elements in MapReduce

Explanation of the diagram

Let’s understand the diagram now. Here, first of all, we will read an input file. The file can be of huge size here. So, we need to divide data into smaller chunks of data. In Hadoop terms, we call this chunk a block. So, we divide data into smaller blocks. Each block will be executed on different mapper functions. Here we take unique words and pass these words as a key and null as value to a reducer. Now, at the reducer, we will simply write input from the reducer to the output file.

Steps to implement

There are three types of classes present in the map-reduce applications. Driver class,  reducer class, and the mapper class. First, the mapper class takes the input in the form of Key and value pairs. After that, the output of the mapper passes into the reducer as input. Reducer performs operations to find out the distinct set of elements in the given text.

After that Let’s understand the driver’s class. It helps us to run the Map-reduce program and answer some of the following questions like

  • What formats will be used for the input and output?
  • What will the output types for the mapper key and value and the reducer key and value be?

Input data

Let’s see, what is the input for this exercise:

India is a country and  US is also country

Mapper class

The implementation of the mapper class is given down below

  • First of all, all the important libraries are imported
  • Then a mapper class is created which takes LongWritable text as Key and Value as Null.
  • Next, we are reading a line from a value and splitting it into an array of words in the field Variable. Later we are iterating through a list of words in an array and write output to the context with key as a word and value as a null. Here we are transferring unique words to a reducer.

Now, we will see reducer side implementation.

 public static class Map extends Mapper<LongWritable, Text,
 Text, NullWritable> {
	   // created object of type Text
        private Text word = new Text();

        @Override
        public void map(LongWritable key, Text value, Context 
context) throws IOException, InterruptedException {
  		  // read a line
            String line = value.toString();
   		  // split a line with a space and retrieve list of
            // words
            String[] fields = line.split(" ");
            for(int i=0; i<fields.lenght; i++){
            	String word_data = fields[i];
          word.set(word_data);
               context.write(word, NullWritable.get());
            }
        }

Reducer class

Here, we are reading mapper output to a reducer as an input. As key will be unique, so all unique words will become one at a time, and writing it to the context output and value will be null. So, the final output will be a unique list of words.

 public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable> {
        @Override
        public void reduce(Text key, Iterable<NullWritable> 
values, Context context) throws IOException,InterruptedException
   {
            context.write(key, NullWritable.get());
        }
  }

Driver class

In the driver class first importing all the necessary libraries to implement driver class.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

Here, we have created a class with the main method and from the main method, we are calling Driver which specifies different configurations and creates a job that we will run through the Hadoop command.

Now, we will look into the mapper side. Here we will see how the mapper is performing to find unique words from the content.

public class DistinctValuesExample extends Configured implements Tool {

public int run(String[] args) throws Exception {
        Configuration conf = this.getConf();

        //input and output paths passed from cli
        Path hdfsInputPath = new Path(args[0]);
        Path hdfsOutputPath = new Path(args[1]);

        //create job
        Job job = new Job(conf, "Distinct Values Example 1");
        job.setJarByClass(DistinctValuesExample.class);

        //set mapper and reducer
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        //set output key and value classes
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //set input format and path
        FileInputFormat.addInputPath(job, hdfsInputPath);
        job.setInputFormatClass(TextInputFormat.class);

        //set output format and path
        FileOutputFormat.setOutputPath(job, hdfsOutputPath);
        job.setOutputFormatClass(TextOutputFormat.class);

        //run job return status
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new DistinctValuesExample1(), args);
        System.exit(res);
    }
}

Output and Explanation

The output of the above program is given down below

  • India
  • is
  • a
  • country
  • and
  • US
  • also

As we can see here there is no repetition of words. Each word comes a single time.

Conclusion

We have seen here how we can use Hadoop map-reduce program to find a distinct set of elements using MapReduce. Hope this will help you understand how it works internally and how other problem statements also we can handle using the same map-reduce approach.

This Post Has 2 Comments

  1. israel night club

    When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve four emails with the exact same comment. Perhaps there is a way you can remove me from that service? Many thanks!

Comments are closed.