Friday, October 10, 2014

Q&A

What is PIG?

PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler that produces sequence of MapReduce Programs.


PIg Data Types ?

  • Atom: An atom is any single value, such as a string or a number — 'Diego', for example. Pig’s atomic values are scalar types that appear in most programming languages — int, long, float, double, chararray and bytearray, for example.
  • Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type — 'Diego', 'Gomez', or 6, for example). Think of a tuple as a row in a table.
  • Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple in the collection can contain an arbitrary number of fields, and each field can be of any type.
  • Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key needs to be unique. The key of a map must be a chararray and the value can be of any type.
The figure offers some fine examples of Tuple, Bag, and Map data types, as well.
image0.jpg

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

http://bluewatersql.wordpress.com/tag/skew-join/
Writing Eval and filter, load and store func.

http://chimera.labs.oreilly.com/books/1234000001811/ch11.html

Does ‘ILLUSTRATE’ run MR job?

No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

Is the keyword ‘DEFINE’ like a function name?

Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported  jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.

Why do we need MapReduce during Pig programming?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want  to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of  Bangalore city  similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’,  you create the alias name which will be the common name for these two cities so that  you create a common key for both the cities and it get passed to the same reducer. For this, we have to write  custom partitioner.
In mapreduce when you create a ‘key’ for city,  you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode.  However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Does Pig give any warning when there is a type mismatch or missing field?

No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What co-group does in Pig?

Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common  data set and the second bag consists of the records of the second data set with the common data set.

Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag,  the respective action will be performed.
Syntax :  FOREACH bagname GENERATE expression1, expression2, …..
The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.
Why Pig ?
i)Ease of programming
ii)Optimization opportunities.
iii)Extensibility
i) Ease of programming :-
It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
ii) Optimization opportunities :-
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
iii) Extensibility :-
Users can create their own functions to do special-purpose processing.
Advantages of Using Pig ?
i) Pig can be treated as a higher level language
  a) Increases Programming Productivity
  b) Decreases duplication of Effort
  c) Opens the M/R Programming system to more uses
ii) Pig Insulates against hadoop complexity
  a) Hadoop version Upgrades
  b) Job configuration Tunning 
Pig Features ?
i) Data Flow Language 
  User Specifies a Sequence of Steps where each step specifies only a single high-level data transformation.
ii) User Defined Functions (UDF)
iii)Debugging Environment
iv) Nested data Model




Explain  Pig Structure In Brief ?
Pig-structure-in-brief
Explain Logical Plan,Physical Plan,Map Reduce Plan ?
pig-logical-plan




Difference Between Pig and SQL ?
Pig is a Procedural                                                    SQL is Declarative
Nested relational data model                                     SQL flat relational
Schema is optional                                                   SQL schema is required
OLAP works                                                             SQL supports OLAP+OLTP works loads
Limited Query Optimization and Siginficent opportunity for query Optimization  
What Is Difference Between Mapreduce and Pig ?
In MR Need to write entire logic for operations like join,group,filter,sum etc ..
•In Pig Bulit in functions are available
•In MR Number of lines of code required is too much even for a simple functionality
•In Pig 10 lines of pig latin equal to 200 lines of java
•In MR Time of effort in coding is high
•In Pig What took 4hrs to write in java took 15 mins in pig latin (approx)
•In MRLess productivity
•In PIG High Productivity 

Wednesday, October 8, 2014

PIG basics


How many ways Pig Latin Commands can be executed ?

Three.

Grunt interactive shell, through a script file, and as embedded queries inside Java programs

To enter into grunt shell which command you use?

pig.

To run pig script which command need to use ?

pig myscript.pig

What is the meaning of running Pig in Hadoop ?

Running Pig in Hadoop mode means the compile Pig program will physically execute in a Hadoop installation.


Grunt shell local mode

pig -x local

Entering the Grunt shell in Hadoop mode is

pig -x mapreducepig -x mapreduce

if no arguments supplied by default it is mapreduce


grunt> set debug on
grunt> set job.name 'my job'


The debug parameter states whether debug-level logging is turned on or off. The job. name parameter takes a single-quoted string and will use that as the Pig program’s Hadoop job name.



The exec command executes a Pig script in a separate space from the Grunt shell


The command run executes a Pig script in the same space as Grunt (also known as interactive mode)



grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, time, query);

The above commands loads the data into alias or variable called log



Pig parses your statements but doesn’t physically execute them until you use a DUMP or STORE command to ask for the results

The DUMP command prints out the content of an alias whereas the STORE command stores the content to a file.


The LIMIT command allows you to specify how many tuples (rows) to return back. For example, to see four tuples of log
grunt> lmt = LIMIT log 4;
grunt> DUMP lmt;

The above statements only outputs 4 tuples.




grunt> log = LOAD 'tutorial/data/excite-small.log'
➥ AS (user:chararray, time:long, query:chararray); grunt> grpd = GROUP log BY user; grunt> cntd = FOREACH grpd GENERATE group, COUNT(log); grunt> STORE cntd INTO 'output';

above script equals to below statement in SQL.

SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;


Difference between SQL and Pig Latin?

Pig Latin is a data processing language. You’re specifying a series of data processing steps instead of a complex SQL query with clauses.

In SQL, we define a relation’s schema before it’s populated with data. Pig takes a much looser approach to schema. In fact, you don’t need to use schemas if you don’t want to, which may be the case when handling semistructured or unstructured data


can expose Pig’s schema for any relation with the DESCRIBE command

grunt> DESCRIBE log;
log: {user: chararray,time: long,query: chararray}

A GROUP BY operation on the relation log generates the relation grpd. Based on the operation and the schema for log, Pig infers a schema for grpd:

grunt> DESCRIBE grpd;

grpd: {group: chararray,log: {user: chararray,time: long,query: chararray}}

grpd. The field logis a bag with subfields user, time, and query.


grunt> DESCRIBE cntd;
cntd: {group: chararray,long}


ILLUSTRATE does a sample run to show a step-by-step process on how Pig would compute the relation.


EXPLAIN:   EXPLAIN [-out path] [-brief] [-dot] [-param ...]
[-param_file ...] alias;
Display the execution plan used to compute a relation. When used with a script name, for example, EXPLAIN myscript.pig, it will show the execution plan of the script.


In order for ILLUSTRATE to work, the load command in the first step must include a schema.

subsequent transformations must not include the LIMIT or SPLIT operators, or the nested FOREACH operator, or the use of the map data type




Data types and schemas:

Fields default to bytearray unless specified otherwise.

int                     Signed 32-bit integer
long                  Signed 64-bit integer
float                2-bit floating point
double                 4-bit floating point
chararray            Character array (string) in Unicode UTF-8
bytearray             Byte array (binary object)

Pig’s data model from the top down

Tuple    :


(12.5,hello world,-2)
A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.    

Bag:

{(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples. A relation is a special kind of bag, sometimes called an outer bag. An inner bag is a bag that is a field within some complex type.
A bag is represented by tuples separated by commas, all enclosed by curly brackets.
Tuples in a bag aren’t required to have the same schema or even have the same number of fields. It’s a good idea to do this though, unless you’re handling semistructured or unstructured data.

Map:

[key#value]
A map is a set of key/value pairs. Keys must be unique and be a string (chararray). The value can be any type.
You reference fields inside maps through the pound operator instead of the dot operator. For a map named m, the value associated with key k is referenced through m#k.


Users can define schemas for relations using the ASkeyword with the LOAD, STREAM, and FOREACH operators.





Relational operators




UNION combines multiple relations together whereas SPLIT partitions a relation into multiple ones. An example will make it clear:
grunt> a = load 'A' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load 'B' using PigStorage(',') as (b1:int, b2:int, b3:int);
grunt> DUMP a;
(0,1,2)
(1,3,4)
grunt> DUMP b;
(0,5,2)
(1,7,8)
grunt> c = UNION a, b;
grunt> DUMP c;
(0,1,2)
(0,5,2)
(1,3,4)
(1,7,8)
grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1;
grunt> DUMP d;
(0,1,2)
(0,5,2)
grunt> DUMP e;
(1,3,4)
(1,7,8)

You can use the DISTINCT operator to remove duplicates from a relation



You can simulate SPLITby multiple FILTERoperators. The FILTER operator alone trims a relation down to only tuples that pass a certain test:
grunt> f = FILTER c BY $1 > 3;
grunt> DUMP f;
(0,5,2)
(1,7,8)


SAMPLE is an operator that randomly samples tuples in a relation according to a specified percentage.



grunt> g = GROUP c BY $2;
grunt> DUMP g;
(2,{(0,1,2),(0,5,2)})
(4,{(1,3,4)})
(8,{(1,7,8)})
grunt> DESCRIBE c;
c: {a1: int,a2: int,a3: int}
grunt> DESCRIBE g;
g: {group: int,c: {a1: int,a2: int,a3: int}}



The first field of GROUP’s output relation is always named group,


one can put all tuples in a relation into one big bag. This is useful for aggregate analysis on relations, as functions work on bags but not relations. For example:


grunt> h = GROUP c ALL;
grunt> DUMP h;
(all,{(0,1,2),(0,5,2),(1,3,4),(1,7,8)})
grunt> i = FOREACH h GENERATE COUNT($1);
grunt> dump i;
(4L)





Now that you’re comfortable with GROUP, we can look at COGROUP, which groups together tuples from multiple relations. It functions much like a join. For example, let’s cogroup a and b on the third column.
grunt> j = COGROUP a BY $2, b BY $2;
grunt> DUMP j;
(2,{(0,1,2)},{(0,5,2)})
(4,{(1,3,4)},{})
(8,{},{(1,7,8)})
grunt> DESCRIBE j;
j: {group: int,a: {a1: int,a2: int,a3: int},b: {b1: int,b2: int,b3: int}}




Whereas GROUP always generates two fields in its output, COGROUP always generates three (more if cogrouping more than two relations). The first field is the group key, whereas the second and third fields are bags. These bags hold tuples from the cogrouping relations that match the grouping key.


If a grouping key matches only tuples from one relation but not the other, then the field corresponding to the nonmatching relation will have an empty bag. To ignore group keys that don’t exist for a relation, one can add the INNER keyword to the operation, like



grunt> j = COGROUP a BY $2, b BY $2 INNER;
grunt> dump j;
(2,{(0,1,2)},{(0,5,2)})
(8,{},{(1,7,8)})
grunt> j = COGROUP a BY $2 INNER, b BY $2 INNER;
grunt> dump j;
(2,{(0,1,2)},{(0,5,2)})









grunt> k = FOREACH c GENERATE a2, a2 * a3;
grunt> DUMP k;
(1,2)
(5,10)
(3,12)
(7,56)

FOREACH is always followed by an alias (name given to a relation) followed by the keyword GENERATE. The expressions after GENERATE control the output.






FOREACH has special projection syntax, and a richer set of functions. For example,
applying nested projection to have each bag retain only the first field:
grunt> k = FOREACH g GENERATE group, c.a1;
grunt> DUMP k;
(2,{(0),(0)})
(4,{(1)})
(8,{(1)})
To get two fields in each bag:
grunt> k = FOREACH g GENERATE group, c.(a1,a2);
grunt> DUMP k;

(2,{(0,1),(0,5)})



(4,{(1,3)})
(8,{(1,7)})
Most built-in Pig functions are geared toward working on bags.
grunt> k = FOREACH g GENERATE group, COUNT(c);
grunt> DUMP k;
(2,2L)
(4,1L)
(8,1L)


To get two fields in each bag:
grunt> k = FOREACH g GENERATE group, c.(a1,a2);
grunt> DUMP k;
(2,{(0,1),(0,5)})

(4,{(1,3)})
(8,{(1,7)})
Most built-in Pig functions are geared toward working on bags.
grunt> k = FOREACH g GENERATE group, COUNT(c);
grunt> DUMP k;
(2,2L)
(4,1L)
(8,1L)



The FLATTEN function is designed to flatten nested data types. Syntactically it looks like a function, such as COUNT and AVG, but it’s a special operator as it can change the structure of the output created by FOREACH...GENERATE.

For example, consider a relation with tuples of the form (a, (b, c)). The statement FOREACH... GENERATE $0, FLATTEN($1) will create one output tuple of the form (a, b, c) for each input tuple.


If a bag contains N tuples, flattening it will remove the bag and create N tuples in its place.
grunt> k = FOREACH g GENERATE group, FLATTEN(c);
grunt> DUMP k;
(2,0,1,2)
(2,0,5,2)
(4,1,3,4)
(8,1,7,8)
grunt> DESCRIBE k;
k: {group: int,c::a1: int,c::a2: int,c::a3: int}
Another way to understand FLATTEN is to see that it produces a cross-product. This view is helpful when we use FLATTENmultiple times within a single FOREACHstatement. For example, let’s say we’ve somehow created a relation l.




Another way to understand FLATTEN is to see that it produces a cross-product. This view is helpful when we use FLATTENmultiple times within a single FOREACHstatement. For example, let’s say we’ve somehow created a relation l.
grunt> dump l;
(1,{(1,2)},{(3)})
(4,{(4,2),(4,3)},{(6),(9)})
(8,{(8,3),(8,4)},{(9)})
grunt> describe l;
d: {group: int,a: {a1: int,a2: int},b: {b1: int}}
The following statement that flattens two bags outputs all combinations of those two bags for each tuple:
grunt> m = FOREACH l GENERATE group, FLATTEN(a), FLATTEN(b);
grunt> dump m;


(1,1,2,3)
(4,4,2,6)
(4,4,2,9)
(4,4,3,6)
(4,4,3,9)
(8,8,3,9)
(8,8,4,9)




bags. Let’s assume you have a relation (say l) and one of its fields (say a) is a bag, a
FOREACH with nested block has this form:
alias = FOREACH l {
tmp1 = operation on a;
[more operations...]
GENERATE expr [, expr...]
}
The GENERATE statement must always be present at the end of the nested block. It will create some output for each tuple in l. The operations in the nested block can create new relations based on the bag a. For example, we can trim down the a bag in each element of l’s tuple.




grunt> m = FOREACH l {
tmp1 = FILTER a BY a1 >= a2;
GENERATE group, tmp1, b;
}
grunt> DUMP m;
(1,{},{(3)})
(4,{(4,2),(4,3)},{(6),(9)})
(8,{(8,3),(8,4)},{(9)})
You can have multiple statements in the nested block. Each one can even be operating on different bags.
grunt> m = FOREACH l {
tmp1 = FILTER a BY a1 >= a2;
tmp2 = FILTER b by $0 < 7;
GENERATE group, tmp1, tmp2;
};
grunt> DUMP m;
(1,{},{(3)})
(4,{(4,2),(4,3)},{(6)})
(8,{(8,3),(8,4)},{})









UDF- User Defined Functions:


UDFs are written in Java, and filter functions are all subclasses of FilterFunc, which
itself is a subclass of EvalFunc.

example of UDF.


package com.hadoopbook.pig;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
public class IsGoodQuality extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}

try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}
}
}


After UDF class is created make a jar and then register with pig as follows

grunt> REGISTER pig.jar;


Finally, we can invoke the function:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> com.hadoopbook.pig.IsGoodQuality(quality);



We can’t register our package with Pig, but we can shorten the function name by defining
an alias, using the DEFINE operator:
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
grunt> filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);



Leveraging types:



The filter works when the quality field is declared to be of type int, but if the type
information is absent, then the UDF fails! This happens because the field is the default
type, bytearray, represented by the DataByteArray class. Because DataByteArray is not
an Integer, the cast fails.
The obvious way to fix this is to convert the field to an integer in the exec() method.
However, there is a better way, which is to tell Pig the types of the fields that the function
expects. The getArgToFuncMapping() method on EvalFunc is provided for precisely this
reason. We can override it to tell Pig that the first field should be an integer:
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>();
funcSpecs.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));
return funcSpecs;
}





A Load UDF
We’ll demonstrate a custom load function that can read plain-text column ranges as
fields, very much like the Unix cut command. It is used as follows:
grunt> records = LOAD 'input/ncdc/micro/sample.txt'
>> USING com.hadoopbook.pig.CutLoadFunc('16-19,88-92,93-93')
>> AS (year:int, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)


The string passed to CutLoadFunc is the column specification; each comma-separated
range defines a field, which is assigned a name and type in the AS clause. Let’s examine
the implementation of CutLoadFunc,



public class CutLoadFunc extends Utf8StorageConverter implements LoadFunc {
private static final Log LOG = LogFactory.getLog(CutLoadFunc.class);
private static final Charset UTF8 = Charset.forName("UTF-8");
private static final byte RECORD_DELIMITER = (byte) '\n';
private TupleFactory tupleFactory = TupleFactory.getInstance();
private BufferedPositionedInputStream in;
private long end = Long.MAX_VALUE;
private List<Range> ranges;
public CutLoadFunc(String cutPattern) {
ranges = Range.parse(cutPattern);
}
@Override
public void bindTo(String fileName, BufferedPositionedInputStream in,
long offset, long end) throws IOException {

this.in = in;
this.end = end;
// Throw away the first (partial) record - it will be picked up by another
// instance
if (offset != 0) {
getNext();
}
}
@Override
public Tuple getNext() throws IOException {
if (in == null || in.getPosition() > end) {
return null;
}
String line;
while ((line = in.readLine(UTF8, RECORD_DELIMITER)) != null) {
Tuple tuple = tupleFactory.newTuple(ranges.size());
for (int i = 0; i < ranges.size(); i++) {
try {
Range range = ranges.get(i);
if (range.getEnd() > line.length()) {
LOG.warn(String.format(
"Range end (%s) is longer than line length (%s)",
range.getEnd(), line.length()));
continue;
}
tuple.set(i, new DataByteArray(range.getSubstring(line)));
} catch (ExecException e) {
throw new IOException(e);

}
}
return tuple;
}
return null;
}
@Override
public void fieldsToRead(Schema schema) {
// Can't use this information to optimize, so ignore it
}
@Override
public Schema determineSchema(String fileName, ExecType execType,
DataStorage storage) throws IOException {
// Cannot determine schema in general
return null;
}
}



STREAM
The STREAM operator allows you to transform data in a relation using an external
program or script. It is named by analogy with Hadoop Streaming, which provides a
similar capability for MapReduce (see “Hadoop Streaming” on page 32).
STREAM can use built-in commands with arguments. Here is an example that uses the
Unix cut command to extract the second field of each tuple in A. Note that the command
and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)




Parallelism



When running in Hadoop mode, you need to tell Pig how many reducers you want for
each job. You do this using a PARALLEL clause for operators that run in the reduce
phase, which includes all the grouping and joining operators (GROUP, COGROUP,
JOIN, CROSS), as well as DISTINCT and ORDER. By default the number of reducers
is one (just like for MapReduce), so it is important to set the degree of parallelism when
running on a large dataset. The following line sets the number of reducers to 30 for the
GROUP:

grouped_records = GROUP records BY year PARALLEL 30;





Parameter Substitution

If you have a Pig script that you run on a regular basis, then it’s quite common to want
to be able to run the same script with different parameters. For example, a script that
runs daily may use the date to determine which input files it runs over. Pig supports
parameter substitution, where parameters in the script are substituted with values
supplied at runtime. Parameters are denoted by identifiers prefixed with a $ character;
for example $input and $output, used in the following script to specify the input and
output paths:
-- max_temp_param.pig
records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
STORE max_temp into '$output';
Parameters can be specified when launching Pig, using the -param option, one for each
parameter:
% pig \
-param input=/user/tom/input/ncdc/micro-tab/sample.txt \
-param output=/tmp/out \

src/main/ch11/pig/max_temp_param.pig