Using Spark ML Machine Learning Library with Python

Here we give an example of how to use the Apache Spark ML machine learning library. We will write a Python program to do logistic regression.In order to use this you need a copy of Apache Spark. The easy way to obtain that is to download a Docker container. Just make sure you get one that includes the Analytics libraries, like NumPy. If you will have to add them to your Python environment manually.


Spark ML is the machine learning library you should be using if you are using Apache Spark. It is better than the other well-known machine learning library for Python, scikit-learn, because it can scale with Apache Spark, as it understands RDDs etc. In other words it can run across a distributed architecture, thus letting it handle very large data sets.Here we will do logistic regression. Logistic regression gives a binary response (1=true and 0=false) given some input. For data, we will use the data from WikiPedia’s definition of logistic regression. In that example they predict how likely someone is to pass an exam based upon the number of hours studied.To be precise, they found that the linear relationship between hours studied, x, and likelihood to pass the exam, y, is:y = 1.5046x - 4.0777Thus, based on the definition of logistic regression, the probability of passing for any given number of hours is:pr = 1 / (1 + e**y)Per the rules of lr, when pr > 50% then the binary response is 1 (true), otherwise 0 (false).


You can copy the code below in pyspark or create a Jupyter or Zeppelin notebook. In the code below we explain each step. import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

# This array gives whether someone passed an exam, on the left-hand

# side, given the number of hours they studied, on the right-hand side.

# As you can see, how much you study is definitely correlated

# with how likely you are to pass.

# We put this into array(float, array(float)) format as that is the format

# expected by the LabeledPoint constructor that you plug into the

# Spark ML Logistic Regression training function.

studyHours = [

[ 0, [0.5]],
[ 0, [0.75]],
[ 0, [1.0]],
[ 0, [1.25]],
[ 0, [1.5]],
[ 0, [1.75]],
[ 1, [1.75]],
[ 0, [2.0]],
[ 1, [2.25]],
[ 0, [2.5]],
[ 1, [2.75]],
[ 0, [3.0]],
[ 1, [3.25]],
[ 0, [3.5]],
[ 1, [4.0]],
[ 1, [4.25]],
[ 1, [4.5]],
[ 1, [4.75]],
[ 1, [5.0]],
[ 1, [5.5]]

# define a function to take a float, array(float) and return a

# LabeledPoint object

def labelPt(label, points):
return LabeledPoint(label,points)

# create an array of LabeledPoint objects

a = []
for p, h in studyHours:

# train the model using the Logistic Regression Spark ML function

lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(a))

# ask Spark to predict whether a student will pass the test if they

“ study 3 hours. It will respond “1” meaning “yes.”

lrm.predict([3])Everything you need to know about outsourcing technology developmentAccess a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.


Speak to our Experts
Lets Talk

Our Latest Blogs

With Zymr you can