Continuing with our series of teaching articles on big data, here is an introduction to using Spark command line shells.
To process data in Spark, you could write a program outside of Spark and run it there or deploy it as a Spark job. You can also use a Spark command line shell to execute it in an interactive manner directly against Spark data.
In the first case, you write code to set the Spark Context. In the second, that is not necessary as it is already built into the shell.
Spark has 3 interactive shells: Scala, Python, and R. The Scala shell (bin/spark-shell) and Python shells (bin/pyspark) are installed by default. To install the R one you just need to compile it (as explained below).
The good thing about a shell from a big data analytics point of view is you can walk through code one line at a time building up data sets by running ETL (extract, transform, load), reduce, map, or whatever you want to call the process of collapsing, joining, intersection, or counting records.
These shells are different than regular Scala, R, or Python because they know about the Spark cluster you are working with. In fact they start Spark for you. So if that data is spread across dozens of machines Spark goes and runs jobs across a dozen machines to perform whatever transformation you request. That’s quite sophisticated when you think about it. But another way it lets you do something quite complicated using a tool that is far simpler and a programming interactive shell that you already know.
Here are some quick examples of what you can do with these shells. We look at two of them: R (called SparkR) and Python.
[See Also: Apache Storm StreamParse Python]
To get going you first install Spark. There are lots of ways to do that. Here is an easy one (Change this to match whatever version you want).
Unzip and then run its build script. Much of Spark is written in Scala, so you run this Scala command to build it:
That command will take some time.
You don’t need to start Spark to use it. The shell will do that.
Then install the R shell.
Now add spark/bin to your path in .bash_profile.
Need a Dataset
Lot of Spark tutorials have you use the Spark README file or create data right in the code with something like these two lines of code shown below.
data1 = [1, 2, 3, 4, 5] data2 = sc.parallelize(data1)
But there is a better way. You can find almost any dataset that anyone has thought up on Kraggle. This is where data scientists share information. They also hold competitions there to see who can most efficiently and elegantly analyze a particular set of data.
So to have something to work with here, we will use a list of universities around that you can download from here.
It’s a .csv (spreadsheet format) file with a list of universities and colleges and what country they are in. You could pick something larger or more complex. But this is enough to get us started.
Assuming you have Spark installed and have set your PATH to include the bin folder there you can then run sparkR by typing:
Unzip the data files you downloaded above. Then load this .csv file into a local data frame like this, changing the directory names to wherever you unzipped the data:
schools<-read.csv(“/home/walker/Documents/world-university-ranking/school_and_country_table.csv”, head=FALSE, sep = “,”)
We say it is local because the data frame you just created is not stored in Spark yet. To do that run:
schoolDF <- createDataFrame(sqlContext, schools)
Suddenly the terminal comes alive as Spark kicks off a job across the cluster to do that. So now you can really see that you have a simple and familiar interface to do something that can be quite complex.
[See Also: Introduction to Meteor JS Stack]
Now let’s do something similar with Python. We will use the same data, but here we illustrate how to group data by running the Spark ReduceByKey function. We will do that to count how many of these schools are in each country.
Run pyspark to open the shell.
First set the location of the data file:
Then create a Spark RDD:
As it stands now we have a list of schools and which country they are located separated by a comma in the schools RDD. All of that is in 1 field, as in “school_name,country” . We need it in 2. To put that in terms of big data, we want tuples like this:
(school name, country)
Then we will add a column with a number that we can have something to sum, meaning calculate a count. So we want to end up with this key-pair format.
So first run this Python command. This takes the first and second text strings separated by a comma and assigns them to our new RDD called pairs.
pairs = schools.map(lambda s: (s.split(“,”),s.split(“,”)))
Now make a new RDD with tuples of (country, 1).
countries = pairs.map(lambda (x, y): (y, 1))
Now use reduceByKey to collapse items onto a common key, meaning country.
counted = countries.reduceByKey(lambda x, y : x + y)
If that command looks complicated know that the input to reduceByKey is 2 values and the output 1. In Python
the word lambda means an in-line function that does not need declaration.
Then we can use this to print out the result:
Which looks like this:
Everything you need to know about outsourcing technology developmentAccess a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.