Apache Pig

Pig is a high level programming language useful for analyzing large data sets.
Pig was a result of development effort at Yahoo!
Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.
Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That's why the name, Pig!
Pig consists of two components:
1. Pig Latin, which is a language
2. Runtime environment, for running Pig Latin programs.
A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output.
These operations describe a data flow which is translated into an executable representation, by Pig execution environment.
Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of.
So, in a way, Pig allows programmer to focus on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Apache Pig - Architecture:

To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Refer the following figure.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.

Pig Architecture

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

Execution modes:

Pig has two execution modes:

Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.

How to execute a pig script in local mode?

Step 1:

$nano sample.txt

This is my first pig program

Step 2:

$nano smp.pig

A = LOAD 'sample.txt';

DUMP A;

Step 3:

$pig -x local smp.pig

Output:

This is my first pig program

This page is under construction

Google Sites

Report abuse