The problem with big data development is that data scientists are not part of the traditional development team. Continuous Analytics is the application of DevOps to big data development in an effort to bring data scientists and big data engineers onto the Jira and Jenkins Agile development processes. So it fits into data science processes that have already been shown to greatly improve the efficiency of Java and other types of development enterprise projects.
Pointy Headed Scientists
The world of data scientist is esoteric fare that only mathematicians and statisticians really understand. Linear programming, regression analysis, and classification are statistical and mathematical techniques that data scientists use to find patterns in data and create predictive models. Traditional programmers do not understand all of that because that is a specialized field.
In the past you would have found data scientists running trials at universities or pharmaceutical companies to look for a correlation of data points that would indicate cause and effect. But the advent of big data has brought all of that to the realm of other fields like advertising and product planning by providing tools like Hadoop that make it easier to deal with unstructured data.
Yet the tools of data scientists are programming languages that would fit into any kind of development paradigm. So why don’t they do that? In particular, the languages of data science are Matlab and R. R is available as a shell for Apache Spark. You can call Matlab functions from Python.
[See Also: Optimizing Continuous Testing for DevOps Success]
Bringing Big Data Under the Development Umbrella
When the data scientist takes a request from, say, product planning, they start with a spreadsheet. They download data from different sources and then walk through it trying out different hypotheses to see if they can find, for example, data points that correlate to product sales. R works like that letting you literally walk through or message data in small steps until it fits what you hope to achieve,
But the problem with working this way is twofold. First, the data scientist is working on a spreadsheet. That makes no sense when they can use Hadoop or Apache Spark directly. Second, when they finally get their algorithms working then have to modify their R, Scala, or Python code to work with Hadoop, Spark, or Storm. That makes no sense either as programmers should not keep two different copies of the same program.
Perhaps one of the most famous of these examples of the correlation between variables—and one that might not even be true, yet is fun to talk about—is that when Republicans get elected to office short skirts go back in fashion. For production planning that means if a Republican gets elected president then start designing and manufacturing short skirts as women and girls will soon scoop them off the shelf.
Inserting the Data Scientist Into the Build Process
The big data engineer already understands how to package what they do under the DevOps model and release it with the Agile Iteration or Scrum Sprint.
They know how to package the Hadoop HDFS and virtual machines in Puppet or Docker containers and release those from Jenkins and store all of those definitions in Github. That is called Infrastructure as Code and is how virtualization is delivered in the DevOps, Continuous Release, Continuous Integration models.
Data scientists can be brought onto this cycle and system as well by getting them to sue Github, Jira, and Jenkins too and add them to the Agile team. When you fit all of that into the Continuous Integration and Continuous Release cycle you have what we call Continuous Analytics.
[See Also: Top Misconceptions Regarding DevOps]
Bringing CI and CR Efficiency to Big Data
Doing this has several benefits. First, the data scientist can more quickly stop working with spreadsheets and start work directly with data feeds coming from the big data engineers and the programmers who are using Twitter and other APIs.
Second, it lets data science and business roll out their ideas in incremental fashion, which is how that should and does work.
For example, product planning and data scientists might decide that they need lots of inputs to fine-tune their planning models, like weather data from NOAA, web browsing history data purchasing from data brokers, and sales history from the Oracle POS sales system.
Each of those data feeds is implemented one at a time. That works because linear programming by definition works with n number of variables. Each new variable added to that model makes that model more accurate. Statisticians would say that it reduces the error indicator variables.
And then there are the benefits derived from getting data scientists to code all their ideas and abstractions and use Github for source code control and versioning.
A couple of businesses have risen to the challenge of this. One is Hydrosphere. This California startup, with programmers and data scientist in Russia, has developed the open source tools Mist and Springhead to push this process along by allowing the instantiation of Spark codes from code and abstracting the big data infrastructure build out.
So regardless of what tool you are using for building your big data environment you can extend the concept of Continuous X (i.e, CI or CR) to Analytics and add order where otherwise you might have chaos, of something that is less than ideal.
Everything you need to know about outsourcing technology development
Access a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.