You are currently viewing Creating and Implementing User Defined Functions in Apache Pig

Creating and Implementing User Defined Functions in Apache Pig

The Hadoop ecosystem would not function without Apache Pig, which offers a high-level scripting language for data processing. The usage of built-in functions, such as sum, avg, count, and others, is one of its fundamental characteristics. But Apache Pig also gives you the freedom to employ user-defined functions (UDFs) that you design. In this blog, we will delve into the creation and implementation of UDFs in Apache Pig.

Users may construct User Defined Functions in Apache Pig, a key part of the Hadoop ecosystem, in a number of different programming languages, including Java, Python, Groovy, and JavaScript. However, Java is the language of preference among them for developing UDFs in Pig.

Pig Latin, the scripting language used by Apache Pig, is naturally written in Java, which is the main justification. Because of this, Java may give complete support for all UDFs in Pig features, but other languages might only offer partial support.

Types of User Defined Functions in Apache Pig

There are various types of user-defined functions present in the Apache Pig. All the types of UDFs are mentioned down below.

Filter Functions

Apache Pig uses filter functions to filter data based on predefined criteria. They accept an input value and output a Boolean result for each element in your dataset. They are therefore perfect for focusing on the relevant details while removing the irrelevant stuff.

For instance, you may choose among entries that satisfy a given requirement and reject entries with null values using a filter function.

DEFINE isAdult (age) RETURNS boolean {
   return age >= 18;
};

filteredData = FILTER data BY isAdult(age);

In this example, the UDF isAdult determines if the age is 18 or above and then returns a Boolean answer. The data is then filtered using this feature via the FILTER command.

Eval Functions

Pig values are accepted as input by eval functions, which produce results in Pig. They are frequently employed to do data transformations or calculations within a FOREACH GENERATE statement. An eval function might carry out tasks like changing the case of strings, computing mathematical formulas, or even more intricate transformations.

DEFINE toUpper (str) RETURNS chararray {
   return str.toUpperCase();
};

upperCaseData = FOREACH data GENERATE toUpper(name);

In this example, The UDF toUpper converts the input string to uppercase. Every name in the data is subject to this function thanks to the FOREACH GENERATE statement.

Algebraic Functions

Algebraic functions in Apache Pig perform MapReduce operations on each statement in the inner bag (a bag is a collection of tuples). These functions are designed to work with large datasets and can significantly enhance the performance of operations that can be expressed algebraically, such as SUM, COUNT, MIN, MAX, etc.

DEFINE sumOfSquares(numbers) RETURNS double {
   double sum = 0;
   for (double number: numbers) {
      sum += number * number;
   }
   return sum;
};

result = FOREACH data GENERATE group, sumOfSquares(values);

In this example, the UDF sumOfSquares determines the sum of squares for a collection of integers. on applying the operation on the dataset, this function is then utilized within a FOREACH statement.

Setting Up Your Development Environment

Here is the step-by-step introduction to setup up your development environment.

  • First of all make sure that Java, Hadoop, and Pig are installed in your system correctly.
  • Go to ‘File’ > ‘New’ > ‘Maven Project’ in the open Eclipse IDE. You might need to install the M2Eclipse plugin if the ‘Maven Project’ option isn’t shown.
  • Keep the default parameters selected in the ‘New Maven Project’ dialogue box and press ‘Next’.
  • You may select to filter for particular project templates on the next screen. Use the ‘maven-archetype-quickstart’ filter to find a simple Java application, then choose it and press the ‘Next’ button.
  • For your project, provide the “Group Id” and “Artefact Id.” Typically, they are written in reverse domain name notation. To create the project, click ‘Finish’.
  • A new Maven project with a file structure and pom.xml file will be generated by Eclipse. You may define dependencies for Hadoop and Pig in the pom.xml file.

Writing a UDF in Java

Now, create a new class named EvalSample and copy the following content in the file.

import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple;

public class EvalSample extends EvalFunc<String> { 

    public String exec(Tuple tuple) throws IOException {   
        if (tuple == null || tuple.size() == 0) {
            return null;      
        }

        String lowerStr = (String) tuple.get(0);      
        return lowerStr.toUpperCase();  
    } 
}

In the above code, this UDF takes a tuple with a string field and returns the uppercase version of that string. If the tuple is null or empty, it simply returns null.

Using Your UDF in Pig

After writing this file we need to compile a java file and export it as a jar file. For example, we have created a jar with the name evalSample.jar.

Now, we will see how we can use this jar file in our further steps.

In order to register this jar we first need to start pig. The command to connect the pig is as below.

$pig -x local

It will connect to the pig in local mode.

Now we will use the register command to register the jar we created.

Command:

REGISTER '/$PIG_HOME/evalSample.jar'

Let’s now define the alias for the registered jar using the DEFINE command as below.

DEFINE eval_sample eval_sample();

Now, we will use the defined function to perform the actual operation.

If we have a student_detail.csv file with the content as below.

IDNameAge
1Raj14
2Vijay16
3Ajay13
4Riyan16
student_detail.csv file

Now, assume we read the content below.

grunt> student_detail = LOAD ‘hdfs://localhost:9000/pig_data/student_detail.csv’ USING PigStorage(‘,’) as (id:int, name:chararray, age:int);

Let us now convert student_detail in upper case using the UDF eval_sample.

grunt> upper_case = FOREACH student_detail GENERATE eval_sample(name);

Verify the content below using the DUMP command.

grunt> DUMP upper_case;

The output will look like this 
(RAJ)
(AJAY)
(VIJAY)
(RYAN)

Conclusion

This lesson has covered the definition of UDFs, various UDF kinds, writing UDFs in Java, and using UDFs with Apache Pig. It’s now up to you to start putting this information to use and making your own UDFs. Please feel free to post any questions in the comments section below.

If you like the article and would like to support me, make sure to: