What is PigLatin in Java

Apache Pig was made in 2006 developed by Yahoo and today is a Apache open source project as Part of the Hadoop ecosystem.

The reason for the development was the complexity and the difficulties in programming MapReduce method in Java, for querying and analyzing large, distributed databases, as well as their strong fixation on the developed data flow.

At Pig it is an expandable one Scripting languages ​​platformwhich includes query optimization and is easy to program. It will no Java knowledge required for the creation of MapRaduce procedures and transformations.

Apache Pig - system architecture and components

Pig can be broken down into two main components Pig Latin Script Language (short Pig Latin) and Runtime engine split (see illustration). Pig Latin is the Interface to the user and makes a procedural language for describing data flows, the syntax and commands of which can be used to create business logics. With Pig Latin data can be easily loaded from files and saved in Hadoop, and queries can be created.

Apache Pig - Script execution sequence

The Runtime engine transforms them with Pig Latin created scripts in MapReduce instructions and programs and optimizes the execution independently. In addition, the Runtime engine responsible for the interaction with the Hadoop infrastructure and returns the results of the execution to the user. The figure shows the subcomponents of the runtime environment:

The process in Pig takes place in three phases:

Phase 1: initialize scripting and execution plan

First, the data to be processed are entered using Pig Latin loaded and an execution script created. After the user has entered his script via the Grunt Shell or the PigServer wrote, this will be sent to the Pig runtime engine to hand over.

The Script describes the logical connection between the data to be queried and needs to be operationalized. To do this, the next step is a logical execution plan to create.

Phase 2: Validate the execution plan and create a MapReduce plan

The script is then broken down into individual instructions and transferred with a parser to check the syntax and validity. this happens via the query parserthat structures the request for further processing.

This is followed by the semantic check and the optimization of the plan. If these steps have been successfully completed, becomes one from the logical planphysical plan who created the Storage of the data is taken into account. From the physical plan is by means of MapReduce compiler an executable MapReduce plan generated and the Execution engine to hand over.

Phase 3: Execute the MapReduce plan and file the results

The execution then takes place, depending on the data and structure of the map-reduce plan, step by step or as a whole by Hadoop. The results are visually returned to the user or saved in the HDFS file system for further processing or later analysis.

Pig data model

The Pig data model sees the four basic types atom, tuple, bag and map in which the data is stored and managed:

  • atom: This type contains an atomic value such as B. "Mustermann".
  • Tuple: A tuple describes a sequence of fields that can accept any data type (e.g. string, integer, date). An example would be ("Max", "Mustermann", 35).
  • Bag: A bag is a collection of tuples with arbitrarily different and nested structures, such as {(“Max”), (“Maxima”, (35, 30))}.
  • Map: The map describes an associative array and has one Key value Structure in which a Key may only be included once. The Key must be a chararray that Value can accept any data type. Example: [name # Max, age # 35].