16: Q114 – Q115 CAP theorem interview Q&As

Q114. What does CAP stand for in CAP theorem?
A114. In a distributed system having two or more nodes, and maintaining two or more copies of your data for fault tolerance, the CAP theorem can be depicted & explained as below:

CAP Theorem

CAP Theorem – source: https://dzone.com/articles/better-explaining-cap-theorem

Consistency – Every read should give the most recent write. E.g. Consistency in SQL ACID property. Every node must be updated before allowing any further reads. For example, if Client A writes 1 and then updates it to 2 to a node X, Client B cannot read 1 from node Y. In other words, all the clients will see the most recent copy of data (i.e. 2).

Availability – Every node executes the query if not failed. Availability is achieved by replicating the data across different machines. For example, HDFS (i.e. Hadoop Distributed File System) keeps at least 3 copies of the same data. If one node fails, the data will be available from another node.

Partition-tolerance – If the network stops delivering messages between sets of nodes in two different subnets, will the system continue to work correctly? It can be achieved by replicating the data across combinations of nodes and subnets.

Tolerance to network partition, where by the cluster continues to function even if there is a “partition” (i.e. communication break-down) between two nodes even-though both nodes are up, and their clients can talk to either one or both of the nodes. In order to get both availability and partition tolerance, you have to give up consistency as your updates may not get replicated in time.

The NoSQL consistency is know as “eventual” consistency as explained in 9 Java Transaction Management Interview Q&As.

CAP theorem asserts that any networked shared-data system can have only two of three desirable properties. That is C & A, A & P, or C & P. This is not completely true in the modern systems as there is an incredible range of flexibility for handling partitions and recovering from them.

Tolerating partitioning in the network, when you have a number of AWS EC2 nodes in a VPC across multiple availability zones and a particular availability zone goes down, then the system should either favour consistency or availability.

In systems that allow reads before updating all nodes, you will get high availability. In systems that lock all the nodes before performing any reads, you will get better consistency.

Q115. Where can you apply the CAP theorem?
A115. With SQL and NoSQL choosing a database can be a daunting task.

Consistency & Availability holds true in traditional SQL databases like MySQL, PostgreSQL, etc.

Availability & Partition-tolerance holds true in NoSQL databases like Cassandra, CouchDB, etc.

Consistency & Partition-tolerance holds true for HBase, MongoDB, Redis, MemcacheDB, etc.

For example,

MySQL database satisfies “Consistency & Availability” of CAP theory with either a monolithic single server database or with replication where all data on one “failure block”.

CA data is consistent among all nodes whilst all nodes are online, and you can read/write from any node and be sure that the data is the same. If you develop a partition among nodes then the data will be out of sync, and won’t re-sync once the partition is resolved.

Cassandra NoSQL database satisfies “Availability & Partition-tolerance” of CAP theory with eventual consistency.

AP nodes remain online even if they can’t communicate with each other and will re-sync data once the partition is resolved. This happens at the expense of consistency where you aren’t guaranteed that all nodes will have the same data either during or after the partition.

Cassandra NoSQL database

Pros:

— Cassandra runs on large clusters with no single point of failure.
— You can use SQL like language for development, and it is optimized for writes and excellent single-row read performance as long as eventual-consistency semantics is acceptable for the use-case.
— Supports secondary indexes on column families.

Cons:

— No support for range based scans.
— No support for atomic compare & set.
— No support for aggregations.

HBase NoSQL database satisfies “Consistency & Partition-tolerance”.

CP data is consistent among all nodes, and maintains partition tolerance by becoming unavailable when a node goes down.

HBase NoSQL database

Pros:

— RDBMS like triggers & stored procedures.
— HDFS support.
— Supports range based row scans.
— Supports atomic compare & set.
— Optimized for reads from the regions servers.
— Supports aggregations – say aggregation on a particular column

Cons:

— Single point of failure if only one HBase Master is used, and the writes are performed by the master.
— Inter row operations are not atomic.


300+ Java & Big Data Interview FAQs

16+ Java Key Areas Interview Q&As

800+ Java Interview Q&As

300+ Java & Big Data Tutorials

Top