Apache Cassandra is one of the new breeds of database systems. Aiming squarely at the Big Data market, Cassandra is a fully NoSQL style database engine. Differing significantly from traditional relational databases due to the fact it is capable of storing and accessing largely unstructured data.
Cassandra has been designed from the ground up to be massively scalable. Development of Cassandra took a big leap forward in 2006 when Facebook made this proprietary technology available as an open source project. Since then, industry giants such as Google and Amazon have contributed to the development of the platform. This means that for large scale commercial use, Cassandra has been proven to work. It powers some of the highest traffic websites in the world.
Alongside its radical NoSQL data storage engine, Cassandra also excels at being extremely scalable. This has been facilitated by implementing a fully peer-to-peer style of distributed architecture. This means that each database node that makes up part of the service platform, participates equally. There is no master and slave relationship with one node being in overall control.
This delivers a platform that is incredibly fault tolerant. If one node goes down, the remaining nodes are still available, with exactly the same dataset. Furthermore, there is no need for developers to produce code to exploit this peer-to-peer architecture, as it is all done transparently in the background.
A final benefit of this peer-to-peer style of architecture is that the entire database service can be split across multiple physical sites, or indeed, even across one or more cloud services. The ability to host the same database, in several physical data centers, adds a very high level of physical data protection.
As individual Cassandra nodes serve, store and modify data, there is a need to propagate changes to data across all of the other nodes. System administrators are able to set the strength of data consistency and replication across the entire node cluster.
This can be performed at a very granular level, even down to the type of database function that was performed. For example, inserting new data may be given a higher replication priority than changing existing data.
As we might expect from a NoSQL platform used by Facebook, Amazon, and Netflix, Cassandra tends to outperform the competition for large scale deployments. As a comparison, when testing Cassandra alongside HBase, the closest comparable technology, Cassandra outperforms HBase by a factor of 8 to 10 for every type of database operation.
Apache Cassandra is entirely suited to large-scale applications that need to access huge volumes of unstructured data. That being said, Cassandra is still a good choice for smaller applications, as it delivers a high level of data protection out of the box.
Developing for Cassandra is very simple, as most of the truly clever aspects of this technology are handled transparently, so developers have no need to develop platform specific code. This makes Cassandra easy to implement, as developers do not have to be brought up to speed to start creating applications.