As cluster computing frameworks go, Apache Spark is undoubtedly a major player in the Big Data market. The ability to interface with other major Big Data technologies such as Hadoop and Cassandra, whilst bringing in major cloud platforms like Amazon Web Services make it almost the go to tech for companies looking to deploy a new data-driven application. But all this functionality comes at a price. Apache Spark is complex, very complex. So, let’s look at some best practices for developers who are new to Apache Spark.
Scala is the embedded language at the heart of Apache Spark. Whilst it is possible to develop basic applications without using Scala, especially now that the Spark DataFrame API has reached a decent level of maturity, you will need to use Scala to do anything more complex that the API can handle.
Tackle the DataFrame API and Spark SQL First
If you are new to Apache Spark and looking to get up to speed fast, then your starting point should be learning how to use the Spark DataFrame API and Spark SQL first. Mastering these two topics will give you a solid grounding in how Spark works. As a rule of thumb, Spark SQL should be used wherever it will work, with the DataFrame API taking up the slack for advanced data driven functions that Spark SQL cannot handle.
Learn to Leverage the Documentation
Apache Spark comes with reams of great documentation. If you have a question, it is likely answered somewhere in this pile of information. Also, third party technologies such as IBM Analytics for Apache Spark come with lots of tutorials that cover not just the tech itself, but also many basic Spark functions.
Spark Fits the DevOps Cycle
Just because we have stepped into the Big Data arena with Apache Spark does not mean we need to change our DevOps cycle. Agile methodologies still work very well with Apache Spark development. Specifically, unit testing and integration testing fit Spark development like a glove.
Apache Spark does not Stream Data in Real-Time
Spark streams data in small chunks named microbatches. This means that streaming is not real-time. In fact, it could be argued that it is not streaming at all. If you need real-time data streaming as part of your application functionality, you will need to integrate some other form of data transport into your application technology stack.
The Spark UI is Useful
Many people devalue the Apache Spark UI as it delivers limited functionality in its current version. However, the functionality it does include is excellent, ad it is a very useful tool for getting an overview of how well a Spark deployment is performing whilst highlighting major bottlenecks.
Once a performance issue is uncovered, it is then time to start breaking out more in-depth tools to track the cause down to its original source. These best practices will go a long way in maximizing the value of your investments on the Apache Spark technology.