Getting Started with Apache Pig

Authored byNirmal Son Feb 17, 2017 in Topic Technology
Nirmal S
Keyword Cloud

Subscribe to email updates

zymr-getting-started-with-apache-pig

Apache Pig makes running operations against data in Hadoop far easier than coding that in Java, which is the most common way to work with Hadoop data without Pig.  Hadoop, itself, is written in Java.  What Pig does is make really simple MapReduce operations by providing a simple syntax for that, similar to SQL.  MapReduce are operations that parse data sets, filter them, change their format, joins set of data, etc.  It is also called ETL (extract, transform, load.)

Here we show some examples of how to use Pig to do MapReduce.

Run ETL on Weather Data

Suppose we have this weather data.  These are temperature readings for early 2017 from weather stations around San Francisco.  The comma-delimited data looks like this:

STATION,STATION_NAME,DATE,TAVG,TMAX,TMIN,TOBS

GHCND:USC00041967,CONCORD WASTEWATER PLANT CA US,20170101,-9999,55,42,-9999

We have to download this to the file /root/Downloads/888069.csv

Now, we start Pig.  You need to have installed Hadoop first in order to use Pig.  We run it with the “local” option in this example to run it is local mode versus cluster mode.  That way we can read local files without having to first have copied them to the HDFS (Hadoop Distributed File System.)

pig -x local

Then we load the weather data into a weather object using this syntax.  This separates the comma-delimited fields by the comma and then assigns the field type to each field.  Notice that there is no native date type in Pig, so we use chararray.

weather = LOAD '/root/Downloads/888069.csv' USING PigStorage(',') as (STATION:chararray,STATION_NAME:chararray,DATE:chararray,TAVG:int,TMAX:int,TTMIN:int,TOBS:int);

Then to look at the data we do the following:

dump weather

(GHCND:USC00047414,RICHMOND CA US,20170201,-9999,73,43,60)
(GHCND:USC00047414,RICHMOND CA US,20170202,-9999,59,47,57)

It has this structure, which we obtain using the word describe:

describe weather

weather: {STATION: chararray,STATION_NAME: chararray,DATE: chararray,TAVG: int,TMAX: int,TMIN: int,TOBS: int}

Each record is a tuple, meaning a set of fields separated by commas. Note that Pig is said to be “lazy.” That means it does not retrieve the data in the LOAD step.  It only does that when you dump it or otherwise use it.  So this is the point at which you would see any errors in the previous step, like the input filename is spelled wrong.

Now we can filter on this data, pulling out January and February weather using a regular expression.

January = FILTER weather BY(DATE matches '201701.*');

February = FILTER weather BY(DATE matches '201702.*');

Which then looks like this.

(GHCND:USC00047414,RICHMOND CA US,20170123,-9999,-9999,42,48)

(GHCND:USC00047414,RICHMOND CA US,20170124,-9999,-9999,43,53)

Now we pull out just the 4 fields we are interested in.

janTemp = FOREACH January GENERATE (STATION_NAME,DATE,TMAX,TMIN);

febTemp = FOREACH February GENERATE (STATION_NAME,DATE,TMAX,TMIN);

Notice that now we no longer have a tuple. We have a tuple within a tuple, which is called a bag, which you can see by the two parentheses on either end.

((LAS TRAMPAS CALIFORNIA CA US,20170122,50,38))
((LAS TRAMPAS CALIFORNIA CA US,20170123,49,36))

You can also confirm this by looking at the structure where it says tuple.

describejanTemp

janTemp: {org.apache.pig.builtin.totuple_DATE_669: (STATION_NAME: chararray,DATE: chararray,TMAX: int,TMIN: int)

We cannot runthe filter operation on that.  We could if it was a map structure, which has keys and values.  So we flatten it back out, which works just like the flat command in other languages.  It takes the object down one level in nesting.  So instead of a tuple with a tuple inside it, we just have a tuple of elements.  We use $0 to refer to the first element.  Each element is, of course, a tuple.

flatJan = FOREACH janTemp GENERATE flatten($0);

flatFeb = FOREACH febTemp GENERATE flatten($0);

Now it looks like a regular Tuple.

(RICHMOND CA US,20170201,73,43)

(RICHMOND CA US,20170202,59,47)

Categories

0 comments

Leave a Reply

Contact Us

Request a Consultation

Smartsourcing: A guide to selecting an Agile Development Partner

Smartsourcing is a brief guide to the world of modern technology partnerships. It was developed through a collaborative effort of top Zymr executives as we uncovered a gap in the market between the perception of what outsourcing used to be, and how leading technology innovators are leveraging this globalized approach to value generation. Read this guide to learn...

  • Key factors to consider for your development strategy
  • Popular destinations with a track record of high quality services
  • Selection criteria to narrow your shortlisted vendors

Get access to Smartsourcing eBook

 30 days 3 Months 1 year Still exploring

Register below to download your free eBook

Register below to download your free White Paper

Register below to download your free Guide

Register below to download your full Case Study

Register below to download your Healthcare Cloud Stack

Register below to download your Microservices eBook

Register below to download your free White Paper