Getting Started with Apache Pig

December 1, 2023

Technology

Apache Pig makes running operations against data in Hadoop far easier than coding that in Java, which is the most common way to work with Hadoop data without Pig. Hadoop, itself, is written in Java. What Pig does is make really simple MapReduce operations by providing a simple syntax for that, similar to SQL. MapReduce are operations that parse data sets, filter them, change their format, joins set of data, etc. It is also called ETL (extract, transform, load.)Here we show some examples of how to use Pig to do MapReduce.Run ETL on Weather DataSuppose we have this weather data. These are temperature readings for early 2017 from weather stations around San Francisco. The comma-delimited data looks like this:STATION,STATION_NAME,DATE,TAVG,TMAX,TMIN,TOBSGHCND:USC00041967,CONCORD WASTEWATER PLANT CA US,20170101,-9999,55,42,-9999We have to download this to the file /root/Downloads/888069.csvNow, we start Pig. You need to have installed Hadoop first in order to use Pig. We run it with the “local” option in this example to run it is local mode versus cluster mode. That way we can read local files without having to first have copied them to the HDFS (Hadoop Distributed File System.)pig -x localThen we load the weather data into a weather object using this syntax. This separates the comma-delimited fields by the comma and then assigns the field type to each field. Notice that there is no native date type in Pig, so we use chararray.weather = LOAD '/root/Downloads/888069.csv' USING PigStorage(',') as (STATION:chararray,STATION_NAME:chararray,DATE:chararray,TAVG:int,TMAX:int,TTMIN:int,TOBS:int);Then to look at the data we do the following:dump weather(GHCND:USC00047414,RICHMOND CA US,20170201,-9999,73,43,60)
(GHCND:USC00047414,RICHMOND CA US,20170202,-9999,59,47,57)It has this structure, which we obtain using the word describe:describe weatherweather: {STATION: chararray,STATION_NAME: chararray,DATE: chararray,TAVG: int,TMAX: int,TMIN: int,TOBS: int}Each record is a tuple, meaning a set of fields separated by commas. Note that Pig is said to be “lazy.” That means it does not retrieve the data in the LOAD step. It only does that when you dump it or otherwise use it. So this is the point at which you would see any errors in the previous step, like the input filename is spelled wrong.Now we can filter on this data, pulling out January and February weather using a regular expression.January = FILTER weather BY(DATE matches '201701.*');

February = FILTER weather BY(DATE matches '201702.*');Which then looks like this.(GHCND:USC00047414,RICHMOND CA US,20170123,-9999,-9999,42,48)

(GHCND:USC00047414,RICHMOND CA US,20170124,-9999,-9999,43,53)Now we pull out just the 4 fields we are interested in.janTemp = FOREACH January GENERATE (STATION_NAME,DATE,TMAX,TMIN);

febTemp = FOREACH February GENERATE (STATION_NAME,DATE,TMAX,TMIN);Notice that now we no longer have a tuple. We have a tuple within a tuple, which is called a bag, which you can see by the two parentheses on either end.((LAS TRAMPAS CALIFORNIA CA US,20170122,50,38))
((LAS TRAMPAS CALIFORNIA CA US,20170123,49,36))You can also confirm this by looking at the structure where it says tuple.describejanTempjanTemp: {org.apache.pig.builtin.totuple_DATE_669: (STATION_NAME: chararray,DATE: chararray,TMAX: int,TMIN: int)We cannot runthe filter operation on that. We could if it was a map structure, which has keys and values. So we flatten it back out, which works just like the flat command in other languages. It takes the object down one level in nesting. So instead of a tuple with a tuple inside it, we just have a tuple of elements. We use $0 to refer to the first element. Each element is, of course, a tuple.flatJan = FOREACH janTemp GENERATE flatten($0);

flatFeb = FOREACH febTemp GENERATE flatten($0);Now it looks like a regular Tuple.(RICHMOND CA US,20170201,73,43)

(RICHMOND CA US,20170202,59,47)

Conclusion

FAQs

Have a specific concern bothering you?

Try our complimentary 2-week POV engagement

About The Author

Speak to our Experts

Our Latest Blogs

how to build ambient clinical documentation solution

July 29, 2026

How to Build an Ambient Clinical Documentation Solution (AI Medical Scribes) - The 2026 Build Playbook

July 28, 2026

Building a Custom ML Pipeline: The 2026 Reference Architecture, Open-Source Building Blocks, and Decision Framework

how to integrate medical IoT devices with EHR

July 27, 2026

Development

Consulting

Maintenance and Support

By application type

By service type

By testing type

By DevOps

By Cloud

Data Analytics & Management

Title

Getting Started with Apache Pig

Conclusion

FAQs

Have a specific concern bothering you?

About The Author

Our Latest Blogs

How to Build an Ambient Clinical Documentation Solution (AI Medical Scribes) - The 2026 Build Playbook

Building a Custom ML Pipeline: The 2026 Reference Architecture, Open-Source Building Blocks, and Decision Framework

IoT Medical Device Integration: Technical Guide to Devices, Gateways & EHR Systems (2026)

Services

What We Think

Who We Are

Locations

Contact