00: 6 Distributed systems interview Q&As

Q1. What are the key requirements to be a distributed system?
A1. A distributed system must satisfy the following 3 characteristics.

1) The computers or nodes operate concurrently.

2) The computers or nodes fail independently, hence must be fault tolerant. Also, scale independently where you can add new nodes to the cluster.

3) The computers or nodes do not share a global clock. The nodes could be on the cloud across different availability zones & time zones. “ap-southeast-1” & “ap-southeast-2” are different availability zones in the region “asia-pacific (Sydney)”.

EC2 nodes across multiple availability zones with auto scaling

Q2. What are some of the distributed systems, and why are they so popular?
A2. HDFS “Data Node” is for storage & “YARN” node manager for the computation.

A typical Hadoop Architecture with storage & compute resources

1) Distributed storage systems like Apache Hadoop, NoSQL databases like HBase, Cassandra, etc and Cloud based data warehouse like Snowflake.

2) Distributed computing with Apache Spark, Apache Storm, AWS EMR, etc.

3) Distributed messaging systems like Apache Kafka, etc.

Q. Why distributed systems are popular?

1) Firstly, modern architectures are cloud based.

2) Modern architectures need to scale independently. Server nodes in 100s with fault-tolerance as opposed to in tens. So, be prepared to use terms like the application was deployed to “24 core 100 node cluster in AWS”, etc.

3) Data sizes in Terabytes & Petabytes as opposed to gigabytes. You need to handle not only structured data, but also semi-structured data like log files, event logs, etc and unstructured data like PDFs, Word documents, etc.

4) Distributed Architectures are often 1) event-driven and 2) asynchronous with more complexities for Solutioning & debugging.

a) Cap theorem asserts that any networked shared-data system can have only two of three desirable properties as in “Consistency & Availability”, “Availability & Partition-tolerance”, or “Consistency & Partition-tolerance”.

b) Nodes can fail independently & new nodes can be added. So, concepts like Consistent Hashing ring, coordinating services like Zookeeper where you can use distributed counters, locks etc.

c) Out of order processing as events are streamed potentially in out of order from various event sources requiring additional logic.

e) Replay considerations as the master data in the HDFS (i.e. Hadoop Distributed File System) raw zone needs to be rerun in batch modes with renewed access patterns, business logic & rules. Append only systems (E.g. Big Data on HDFS) don’t update data models, but keep appending to the existing data & evaluate the current state by 1) deleting the current data models and 2) replaying the event logs (i.e. the master data). These jobs are run as batch jobs with eventual consistency. This is known as event-sourcing.

f) The NoSQL (i.e. Not Only SQL) databases make development quicker as they are schema-less (i.e. implicit schema), scale rapidly by adding more nodes & handle large volumes of aggregated data (i.e. you store Order, Line Items, and possibly Customer details together as a JSON clob or key-value pairs), which minimizes the number of joins. The NoSQL databases don’t handle ACID (e.g. transaction boundaries, isolation levels, constraints like foreign key, etc) properties, and shift that responsibility to the application layer. This can add more challenges.

g) The data structuring, partitioning & distribution challenges – Skewed data, data shuffling, cartesian join prevention, etc. You will see potential challenges and solutions on the post 15 Apache Spark best practices & performance tuning interview FAQs.

h) Race conditions, performance issues, and out-of-memory issues when dealing with larger volumes of immutable & append-only data logs that need to be replayed into memory or indexed data models for responsiveness & real-time querying.

Q3. Can you name some of the distributed systems?

Snowflake is a fully cloud based unlimited storage and compute. Snowflake is a massively parallel processing (i.e. MPP) database that is fully relational, ACID compliant, and processes standard SQL.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Apache Kafka is a distributed streaming platform that is used to build real time streaming data pipelines and applications that adapt to data streams.

Q4. How will you be going about choosing between an SQL Vs NoSQL database?
A4. Any non-trivial application requires a persistent store. Now a days you have many choices between monolithic RDBMs databases & distributed NoSQL databases like MongoDB, HBase, Cassandra, etc.

CAP Theorem

CAP Theorem – source: https://dzone.com/articles/better-explaining-cap-theorem

What is a distributed database?

Distributed databases are logically interrelated with each other to often represent as a single logical database, but the data is physically stored across multiple nodes in multiple sites and independently managed.

In general, distributed databases have the following features:

1) Location independent: For example, across multiple AWS availability zones.

2) Distributed query processing: For example, Apache Spark, Hive, Impala, etc

3) Distributed transaction management (E,g. Paxos, Raft, Zab, etc) & hardware, operating system & network independent.

When to use distributed databases

1) When the amount of data stored & the number of reads are getting very larger & larger the distributed databases are much easier to horizontally scale by adding more nodes/computers. They can also store both structured & unstrctured data.

2) When the data and DBMS software are distributed over several nodes & sites, one site may fail whilst other sites continue to operate and you are not only able to access the data, but also access the data that exist at the failed site due to the data replication. This gives increased reliability and availability as there is no single point of failure.

3) When the queries are broken up & executed in parallel better performance can be achieved for a very large volume of data.

Q5. What are some of the basic concepts you must know when working with the distributed systems?

1. Partitioning

You need to have a way to distribute the data across multiple nodes. For example, hashing a part of every table’s primary key the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider:

Partition key must have enough values to spread the data for each table evenly across all the nodes in the cluster. Otherwise your data will be skewed, where some nodes will be working harder than the other nodes. The partitions which are highly loaded become the bottlenecks for the system, and this is known as hotspotting.

Typical real-world partition keys are user id, device id, account number etc. along with a time modifier like year-month or year added to the partition key to spread the data.

Key Range partitioning is form of partitioning, where you divide the entire keys into continuous ranges and assign each range to a partition. The downside of using key range partitioning is that if the range boundaries are not decided properly, it may lead to hotspots. Key Hash Partitioning is a form of partitioning to apply a hash function on the key which results in a hash value, and then mod it with the number of partitions. The same key will always return the same hash code. The real problem with hash partitioning starts when you change the number of partitions. Consistent Hashing to the rescue where the output range of a hash function is treated as a fixed circular space or “ring”. Each node in the system is assigned a random value within this space which represents its “position” on the ring.

2. Replication

The same piece of data is usually replicated to multiple machines/nodes to give fault tolerance & high availability.

3. Consistency

When you update that piece of information on one of the machine/node, it may take some time (usually milliseconds) to reach every machine/node that holds the replicated data. This creates the possibility that you might get information that hasn’t yet updated on the replica. In many scenarios, the this anomaly is acceptable. In scenarios where this type of behaviour is not acceptable, the NoSQL databases may give the the choice, per transaction, whether data to be eventually consistent or strongly consistent.

4. Partition tolerance

Tolerating partitioning in the network, when you have a number of EC2 nodes in a VPC across multiple availability zones and a particular availability zone goes down, then the system should either favour consistency or availability as per the CAP theorem diagram shown above.

HBase, MongoDB, Redis, etc favours consistency, whilst Cassandra and AWS Dynamo DB favours availability.

5. File formats & compression

Distributed systems use container data formats like Avro, Parquet, ORC, Sequence file, etc where the data can be compressed with algorithms like LZO, Snappy, LZ4, etc and data can be easily split to be distributed across multiple nodes. 4 key considerations in choosing the storage file formats are:

1) Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2) Splittability to be processed in parallel.
3) Block compression saving storage space vs read/write/transfer performance
4) Schema evolution to add fields, modify fields, and rename fields.

6. Schema-on-read

Data can be stored schema-less and a structure can applied during processing (i.e. on-read) time based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema need to be defined before the data can be loaded.

Despite the schema-less nature, schema design is an important consideration. This includes directory structures and schema of objects stored as metadata (E.g. Hive meta store).

7. Shared nothing architecture

In distributed systems each node is completely independent of other nodes. There are no shared resources like CPU, memory, and disk storage that can become a bottle-neck. For example, Hadoop’s processing frameworks like Spark, Pig, Hive, Impala, etc processes distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very

1. Scalable as more nodes can be added without further contention.

2. Fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.

Q6. What is a Paxos protocol?
A6. Distributed systems are implemented for high availability and scalability using many commodity machines, hence these machines are not reliable and these come up and go down quite often. In such systems, there is often a need to agree upon “something” i.e. to have consensus.

Each server node keeps its own log and nodes can communicate with each other to establish the order of the events. How can all nodes agree on a consistent view of the log? Paxos is a consensus protocol to agree on the order.

Paxos/Raft/Zab are the different variation of distributed consensus algorithms. They follow 2 key principles:

1) two phase commit, and
2) quorum majority vote.

This means a value is committed only when majority nodes have gone through the above 2-phase commit process.

Zab stands for Zookeeper Atomic Broadcast. Raft algorithms simplify Paxos algorithms.

Please clear & refresh your browser history/cache if you face any issues.