Here is the simplest possible overview of Storm.  When I say simple I mean we are going to look at the bare minimum concepts without going into any complexity and keep it short.  That is how an introduction should be with something that otherwise would be complicated because it is so large.

Storm and Streaming Data

The main data tools are Spark, Hadoop MapReduce, and Storm.  All three take input and are able to sort it, filter it, query it, or join it.  Hadoop persists this to storage.  You can persist the Spark RDD (resilient distributed datasets) to storage too, but that’s not usually what you do as holding all that data aloft in memory is the whole point of Spark. Storm processes streaming data, which there is no point in saving that, as it is streaming. In other words, you want to be looking at what is coming in and not what you have already looked at. Yet you can persist Storm to permanent storage as we will explain in another post.

Streaming data is any kind of data that you read without opening and the closing the input like you would do with a traditional data file.  Examples of streams are Twitter Tweets, IoT sensor data (which can be a large stream or just a small trickle), stock and commodity prices, etc.  Storm is designed to handle up to 1 million events per second and can scale without limit.

[See Also: Evolution of Outsourcing & How the Cloud Era is Ushering New Business Models]

Twitter bought Storm and then made it an open source project.  They use it to process tweets, culling through those, creating hashtags, filtering by users, and so forth.  Yahoo Finance uses it to do such things as alert when a stock price hits a certain threshold.  OpenSignal uses it to create cellular coverage maps for all of the globe.  It’s one of my favorite websites as it is a constant reminder of the spotty coverage we have in the country where I live, Chile.

Basic Concepts

Setting up Storm is, like many of these Apache projects, not too difficult.  You first set up a product called Zookeeper then you install Storm.  Then you start Storm Nimbus and Storm Supervisor.  The graphic looks like this:


Zookeeper as you can see sits there in the middle.  Since Storm is stateless, it uses Zookeeper to share and sync data between them.   Nimbus hands off work to Supervisors which exist to process Bolts and Spouts.

Spouts are input data.  Bolts are output. The HelloWorld program of big data tutorials is the WordCount program which takes a list of words and counts duplicate words.  In that scenario, Spout would gather words from some input and Bolt would aggregate and count them.

Storm uses tuples as data structures, similar to a hashtable or vector.  Precisely, tuples are an ordered set, which is a sorted vector with no duplicates.  The tuple can contain any type of data, like this for example:

{ fish, bird, turtle }

Or simple integers:

{1, 2, 3}

Then applying it all together we have a topology.  I like to say the classic definition of that because it sounds so mathematically precise:

“A topology is a directed graph where vertices are computation and edges are a stream of data.”

Like this:

The mathematical concept of directed graphs has come into vogue because they are so often used in analytics. For example, a directed graph can map the relationship between people on Facebook.  In the graphic above we have both the spout (a square) and bolt (a circle): both are vertices meaning points.  The edges are vectors or (lines with an arrow on the end indicating direction) between vertices.

[See Also: Why Apache Spark Is Winning Big Data Analytics Domain]

Basic Program Structure

Here is a barebones description of the objects and methods that you would use to write a Storm program. (If you don’t know some programming language then learn one as all of these big data platforms and tools are really just APIs that you can’t make do something useful unless you learn how to program them.  An analyst or administrator can no longer do as they did before and just learn how to configure applications.)

The basic steps to writing a Storm program are to write a spout, write a bolt, then make it run by writing a topology.  To write the topology you can use yet another abstraction which is the Twitter Trident extension to Storm that makes things like filtering and joining easier.  In this example we stick with the topology object.

Write a Spout

The code here is Java but the concepts are the same for Python and other languages.

Spout implements the backtype.storm.topology.IRichSpout interface.

It has these three methods:

open(Map conf, TopologyContext context,
SpoutOutputCollector collector)


declareOutputFields(OutputFieldsDeclarer declarer)

These are straightforward and hardly need any definition:  connect, retrieve next record, and define what
fields are in the tuple.

Write a Bolt

A bolt implements the IRichBolt interface.  Its logical sequence is also straightforward:

prepare(Map conf, TopologyContext context, OutputCollector collector)

execute(Tuple tuple)

declareOutputFields(OutputFieldsDeclarer declarer)

Create a Topology and Run It

You can think of the topology as the Java class with Main method inside of it, meaning something you can run from the command line.  To make it run forever, as it is listening to a stream of data, you would just put inside a loop.

Topology implements backtype.storm.topology.TopologyBuilder, it has these two main method,



Wrapping Up

So that is a basic overview.  In the next post we will illustrate how to put all this together with an example.  Then we will explain how you can stand Storm up in front of Hadoop so that you can save data so that you can run analytics, for example, against data obtained over time.

Everything you need to know about outsourcing technology development Access a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.

Introduction Package

Source: zymr

Zymr blogger


Leave a Reply