Apache Cassandra is the leading NoSQL, distributed database management system driving many of today’s modern business applications by offering continuous availability, high scalability and performance, strong security and operational simplicity while lowering overall cost of ownership. In Cassandra, any node can perform any operation, also known as having decentralized architecture.
Cassandra has an excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-case. Cassandra quorum reads, which are required for strict consistency, will naturally be slower than Hbase reads. Cassandra also does not support Range based row-scans which may be limiting in specific use-cases.
Consistency refers to how up-to-date and synchronized all replicas of a row of Cassandra data are at any given moment. To ensure that Cassandra can provide the proper levels of consistency for its reads and writes, Cassandra extends the concept of eventual consistency by offering tuneable consistency. You can tune the consistency level per-operation, or set it globally for a cluster or datacenter. You can vary the consistency for individual read or write operations so that the data returned is more or less consistent as required by the client application.
Cassandra is excellent for write operations but not so fast on read operations. Both are rather fast but Cassandra does write operation faster. Cassandra has benefits being +HA (no SPOF) + having tuneable Consistency. Cassandra is very fast writing bulk data in sequence and reading them sequentially. Cassandra is very fast in throughput and from operations perspective too. Cassandra is very easy to maintain as it is very reliable and a robust systems architecture.
When a read request for a row comes into a node in Cassandra, the row must be combined from all SSTables on that node that contain columns from the row in question as well as from any unflushed memtables, to produce the requested data. To optimize this piercing-together process, Cassandra uses an in-memory structure called a Bloom filter. Each SSTable has a Bloom filter associated with it that checks if any data for the requested row exists in the SSTable before any disk input/output. Cassandra is very performant on reads when compared to other storage systems, even for read-heavy workloads. As in any database, reads are best when the hot working set fits into memory. Although all modern storage systems rely on some form of caching to allow for fast access to hot data, not all of them degrade gracefully when the cache capacity is exceeded and disk input or output is required. Cassandra’s read performance benefits from built-in caching, but it also does not dip dramatically when random disk seeks are required.
Most times read performance when using Cassandra gets decreased when some operations are done wrongly such as index interval, bloom filter false positive, consistency level, read repair chance, caching, compaction, data modeling and cluster deployment.
To improve read performance, increasing the replication factor might help, although will make your cache less efficient since each node will store more data. It is probably only worth doing if your read pattern is mostly random, your data is very large, you have low consistency requirements and your access is read heavy. If you want to decrease read latency, you can use a lower consistency level. Reading at consistency level CLONE gives the lowest read latency at a cost of consistency. You will only get consistent reads at CLONE if writes are at CL.All. But if consistency is required it is a good trade off. If you want to increase read throughput, you can decrease read_repair_chance. The number specifies the probability that Cassandra performs a read repair on each read. Read repair involves reading from available replicas and updating any that have old values.