You are currently viewing How to find an inverted index of the sample using MapReduce

How to find an inverted index of the sample using MapReduce

MapReduce is a programming model used to perform a different type of analysis on a large amount of data. Here we are using MapReduce to generate an inverted index on sample data.

Let’s first understand what is an inverted index. The inverted index is nothing but the glossary behind the book which says which word is available on which page.

Let’s say for example if you want to search word Hadoop in a book. But, you don’t know on which page it is available. In that case, you need to start searching from the first page of the book and need to search until you don’t reach the word Hadoop. If the word is on the last page of the book you need to go through the whole book which will be tedious and very time-consuming. Right? So, here comes the Inverted index to rescue. It will store keywords and their page number separately. So, if you want to find the keyword Hadoop, you just need to go through the inverted index which will be comparatively very low in size compared to reading the whole book. You will get to know that

Hadoop word is available on page no. 375 and you will directly go there without searching in 374 pages. Isn’t it helpful and time savvy? So, let’s implement an inverted index now.

MapReduce diagram to find an inverted index

MapReduce architecture finding the inverted index of a word

Explanation of the diagram

Let us understand the diagram now. As we are generating an inverted index we will be dealing with multiple input files. So, we have read a directory having multiple files here. We will be reading a different file with different blocks which will go as an input to different mappers. In Mapper, we will write the first name as a key and the file name in which the first name exists as a value Which will make an entry in the reducer as an input. Here, in reducer, we will make an Inverted Index with the first name as a key and a list of pages on which this particular first name is available as a Value. It will go in an output file.

Problem statement

You have given three sample files about the employee working in the XYZ company. The main goal of this task is to create a MapReduce program to find the inverted index of the word (it means on which page the word is written).

Input data Sample

We will work on sample data here. Let’s say we have 3 pages of data about employees

Page_1.csv, Page_2.csv & Page_3.csv

On every page there will be employee detail as below:

Page1.csv

  • dubert,tomasz ,paramedic i/c,fire,f,salary,,91080.00,
  • edwards,tim p,lieutenant,fire,f,salary,,114846.00,

Page2.csv

  • elkins,eric j,sergeant,police,f,salary,,104628.00,
  • estrada,luis f,police officer,police,f,salary,,96060.00,
  • finn,marie a,clerk iii,police,f,salary,,53076.00,

Page3.csv

  • finn,sean p,firefighter,fire,f,salary,,87006.00,
  • fitch,jordan m,law clerk,law,f,hourly,35,,14.51

Steps to implement

As we know now what is the concept of an inverted index and basic flow about how will we go further. Let’s actually implement it now. As we know any map-reduce program will have 3 major parts: Driver class, Mapper class, and the Reducer class.

Mapper Class

Here, we are reading data and splitting data comma separated. We have retrieved the file name from the input split and retrieved the first name from the content. Here, we write mapper output with name as key and file name as value.

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class InvertedIndexNameMapper extends Mapper < Object, Text, Text, Text > {
    private Text nameKey = new Text();
    private Text fileNameValue = new Text();
    @Override
    public void map(Object key, Text value, Context context) throws IOException,
    InterruptedException {
        String data = value.toString();
        String[] field = data.split(",", -1);
        String firstName = null;
        if (null != field && field.length == 9 && field[0].length() > 0) {
            firstName = field[0];
            String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
            nameKey.set(firstName);
            fileNameValue.set(fileName);
            context.write(nameKey, fileNameValue);
        }
    }
}

Reducer Class

After the mapper phase in the reducer phase, we read input comes from the mapper. Here the key is the first name and the value is the list of file names combined with space. We have written output to the file.

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class InvertedIndexNameReducer extends Reducer < Text, Text, Text, Text > {
    private Text result = new Text();

    public void reduce(Text key, Iterable < Text > values, Context context) throws IOException,
    InterruptedException {
        StringBuilder sb = new StringBuilder();
        boolean first = true;
        for (Text value: values) {
            if (first) {
                first = false;
            } else {
                sb.append(" ");
            }
            if (sb.lastIndexOf(value.toString()) < 0) {
                sb.append(value.toString());
            }
        }
        result.set(sb.toString());
        context.write(key, result);
    }
}

Driver Class

Driver class is used for doing the configuration part to run the map-reduce program successfully. It contains different types of configurations like job name, input path, output path, mapper class, Reducer class, output key type, and output value type. We can also verify if it’s running successfully or not.

import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DriverInvertedIndex {
    public static void main(String[] args) throws Exception {

        args = new String[] {
            "Replace this string with Input Path location",
            "Replace this string with output Path location"
        };

        /* delete the output directory before running the job */
        FileUtils.deleteDirectory(new File(args[1]));

        /* set the hadoop system parameter */
        System.setProperty("hadoop.home.dir", "Replace this string with hadoop home directory location");

        if (args.length != 2) {
            System.err.println("Please specify the input and output path");
            System.exit(-1);
        }

        Configuration conf = ConfigurationFactory.getInstance();
        Job job = Job.getInstance(conf);
        job.setJarByClass(DriverInvertedIndex.class);
        job.setJobName("Find_Average_Salary");

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(InvertedIndexNameMapper.class);
        job.setReducerClass(InvertedIndexNameReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Output and Explanation

The output of the above program is given down below

  • dubert Page_1.csv
  • edwards Page_1.csv
  • elkins Page_2.csv
  • estrada Page_2.csv
  • finn Page_2.csv Page_3.csv
  • finch Page_3.csv

Here the key is the first name and values are files in which this first name is available. In the first example, the dubert word is available on Page 1.

Conclusion

We have seen how to implement an inverted index using map-reduce and also understood the Importance of it. We can also do other analyses based on this problem statement. Hope it will be helpful to grow more in the big data domain.

This Post Has 3 Comments

  1. israel night club

    When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve four emails with the exact same comment. Perhaps there is a way you can remove me from that service? Many thanks!

  2. turnkey

    I’m not Ñ•ure where you аre getting your information, but good topic.

    I needs to spend some time learning moгe or understanding more.

    Thanks for mаgnificent info I was looking for this info for my mіssion.

Comments are closed.