Here we give an example of how to use the Apache Spark ML machine learning library. We will write a Python program to do logistic regression.

In order to use this you need a copy of Apache Spark. The easy way to obtain that is to download a Docker container. Just make sure you get one that includes the Analytics libraries, like NumPy. If you will have to add them to your Python environment manually.

[See Also: QA in DevOps is Shrinking – Here’s What You Can Do About It]


Spark ML is the machine learning library you should be using if you are using Apache Spark. It is better than the other well-known machine learning library for Python, scikit-learn, because it can scale with Apache Spark, as it understands RDDs etc. In other words it can run across a distributed architecture, thus letting it handle very large data sets.

Here we will do logistic regression. Logistic regression gives a binary response (1=true and 0=false) given some input. For data, we will use the data from WikiPedia’s definition of logistic regression. In that example they predict how likely someone is to pass an exam based upon the number of hours studied.

To be precise, they found that the linear relationship between hours studied, x, and likelihood to pass the exam, y, is:

y = 1.5046x - 4.0777

Thus, based on the definition of logistic regression, the probability of passing for any given number of hours is:

pr = 1 / (1 + e**y)

Per the rules of lr, when pr > 50% then the binary response is 1 (true), otherwise 0 (false).

[See Also: Use of ELK Stack (ElasticSearch, LogStash and Kibana)]


You can copy the code below in pyspark or create a Jupyter or Zeppelin notebook. In the code below we explain each step.

 import numpy as np
 from numpy import array
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithLBFGS

# This array gives whether someone passed an exam, on the left-hand

# side, given the number of hours they studied, on the right-hand side.

# As you can see, how much you study is definitely correlated

# with how likely you are to pass.

# We put this into array(float, array(float)) format as that is the format

# expected by the LabeledPoint constructor that you plug into the

# Spark ML Logistic Regression training function.

studyHours = [

[ 0, [0.5]],
 [ 0, [0.75]],
 [ 0, [1.0]],
 [ 0, [1.25]],
 [ 0, [1.5]],
 [ 0, [1.75]],
 [ 1, [1.75]],
 [ 0, [2.0]],
 [ 1, [2.25]],
 [ 0, [2.5]],
 [ 1, [2.75]],
 [ 0, [3.0]],
 [ 1, [3.25]],
 [ 0, [3.5]],
 [ 1, [4.0]],
 [ 1, [4.25]],
 [ 1, [4.5]],
 [ 1, [4.75]],
 [ 1, [5.0]],
 [ 1, [5.5]]

# define a function to take a float, array(float) and return a

# LabeledPoint object

def labelPt(label, points):
 return LabeledPoint(label,points)

# create an array of LabeledPoint objects

a = []
 for p, h in studyHours:

# train the model using the Logistic Regression Spark ML function

lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(a))

# ask Spark to predict whether a student will pass the test if they

“ study 3 hours. It will respond “1” meaning “yes.”


Everything you need to know about outsourcing technology development

Access a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.

Introduction Package

Zymr blogger


Leave a Reply

© 2019, Zymr, Inc. All Rights Reserved.| LEGAL DISCLAIMER | PRIVACY POLICY | COOKIE POLICY