Q1. What are the key requirements to be a distributed system?
A1. A distributed system must satisfy the following 3 characteristics.
1) The computers or nodes operate concurrently.
2) The computers or nodes fail independently, hence must be fault tolerant. Also, scale independently where you can add new nodes to the cluster.
3) The computers or nodes do not share a global clock. The nodes could be on the cloud across different availability zones & time zones. “ap-southeast-1” & “ap-southeast-2” are different availability zones in the region “asia-pacific (Sydney)”.
Q2. What are some of the distributed systems, and why are they so popular?
A2. Modern architectures use distributed 1) Storage & 2) computation.
Microservices use distributed computing & storage with benefits such as:
1) Fast & parallel development & deployment.
2) Improved scalability, availability & fault tolerance.
3) More closely integrated with DevOps & aligned to business domains.
Challenges mostly related to distributed storage & computing such as:
a) Transactions spanning multiple databases across multiple services. “Service A” may use Amazon RDS (i.e. Relational Database Service) with Amazon Elastic Cache, and “Service B” may use Amazon NoSQL database DynamoDB, and “Service C” may use just Amazon RDS (i.e. Relational Database Service) with Kinesis to stream data to a machine learning system.
b) Most feasible way to handle consistency across microservices is via eventual consistency. This model doesn’t enforce distributed ACID transactions across microservices. A better approach is to use event sourcing, which is an event-centric approach to business logic design and persistence. It favours to use some mechanisms of ensuring that the system would be eventually consistent at some point in the future. 9 Java Transaction Management Interview Q&As.
c) Use of event sourcing for persistence where a service persists each aggregate as a sequence of events. It reconstructs the current state of an aggregate by loading the events and replaying them. In functional programming terminology it is aggregated by a fold/reduce over the events.
d) Lots of moving parts requiring increased coordination & message routing.
e) Every Microservice must have its own rollback method, which either rollback changes or trigger subsequent actions (e.g. sending a notification).
Big Data systems where HDFS (i.e. Hadoop Distributed File System) “Data Node” is for storage & “YARN” (i.e. Yet Another Resource Negotiator) node manager is for the computation. The data is distributed across multiple machines (aka nodes) & the computation code is also copied across multiple machines & executed (E.g. via Apache Spark executor) against each machine’s data concurrently. Each machine will return a subset of the data to the master node to be combined & returned the final result to the client. So, unlike multithreading, which shares CPU, storage & Memory with other threads running within the same machine, the distributed systems use share nothing architecture whereby each node will have its own CPU, storage & memory. These nodes will horizontally scale. In cloud computing this is known as “auto scaling” with a flick of a configuration change via a console.
1) Distributed storage systems like Apache Hadoop Distributed File System (i.e. HDFS), NoSQL databases like HBase, Cassandra, MongoDB, Redis, etc and Cloud based data warehouses like Snowflake, Apache Hive, Amazon Redshift, etc.
2) Distributed computing with Microservices, Apache Spark, Apache Storm, AWS EMR (i.e. Elastic Map Reduce), etc.
3) Distributed messaging systems like Apache Kafka, Amazon Kinesis Data Streams, Amazon SQS (i.e. Simple Queue Service), Amazon SNS (i.e. Simple Notification Service with pub/sub model), etc. Amazon SQS & SNS are useful in building Microservices.
Q. Why distributed systems are popular?
1) Firstly, modern architectures are cloud based.
2) Modern architectures need to scale independently. Server nodes in 100s with fault-tolerance as opposed to in tens. So, be prepared to use terms like the application was deployed to “24 core 100 node cluster in AWS”, “3.6 Petabytes of structured & semi-structured data was stored across 280 node cluster, etc.
3) Data sizes in Terabytes & Petabytes as opposed to gigabytes. You need to handle not only structured data, but also semi-structured data like XML, JSON, log files, event logs, etc and unstructured data like e-mail messages, PDFs, Word documents, etc.
4) Distributed Architectures are often 1) event-driven and 2) asynchronous with more complexities for Solutioning & debugging.
a) Cap theorem asserts that any networked shared-data system can have only two of three desirable properties as in “Consistency & Availability”, “Availability & Partition-tolerance”, or “Consistency & Partition-tolerance”.
b) Nodes can fail independently & new nodes can be added. So, concepts like Consistent Hashing ring, coordinating services like Zookeeper where you can use distribute configuration settings across nodes, have distributed counters, distributed locks, etc.
c) Out of order processing as events are streamed potentially in out of order from various event sources requiring additional logic.
e) Replay considerations as the master data in the HDFS (i.e. Hadoop Distributed File System) raw zone needs to be rerun in batch modes with renewed access patterns, business logic & rules. Append only systems (E.g. Big Data on HDFS) don’t update data models, but keep appending to the existing data & evaluate the current state by 1) deleting the current data models and 2) replaying the event logs (i.e. the master data). These jobs are run as batch jobs with eventual consistency. This is known as event-sourcing.
f) The NoSQL (i.e. Not Only SQL) databases make development quicker as they are schema-less (i.e. implicit schema), scale rapidly by adding more nodes & handle large volumes of aggregated data (i.e. you store Order, Line Items, and possibly Customer details together as a JSON clob or key-value pairs), which minimizes the number of joins. The NoSQL databases don’t handle ACID (e.g. transaction boundaries, isolation levels, constraints like foreign key, etc) properties, and shift that responsibility to the application layer. This can add more challenges.
g) The data structuring, partitioning & distribution challenges – Skewed data, data shuffling, cartesian join prevention, etc. You will see potential challenges and solutions on the post 15 Apache Spark best practices & performance tuning interview FAQs.
h) Race conditions, performance issues, and out-of-memory issues when dealing with larger volumes of immutable & append-only data logs that need to be replayed into memory or indexed data models for responsiveness & real-time querying.
Q3. Can you name some of the distributed systems?
Snowflake is a fully cloud based unlimited storage and compute. Snowflake is a massively parallel processing (i.e. MPP) database that is fully relational, ACID compliant, and processes standard SQL.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Apache Kafka is a distributed streaming platform that is used to build real time streaming data pipelines and applications that adapt to data streams.
Q4. How will you be going about choosing between an SQL Vs NoSQL database?
A4. Any non-trivial application requires a persistent store. Now a days you have many choices between monolithic RDBMs databases & distributed NoSQL databases like MongoDB, HBase, Cassandra, etc.
What is a distributed database?
Distributed databases are logically interrelated with each other to often represent as a single logical database, but the data is physically stored across multiple nodes in multiple sites and independently managed.
In general, distributed databases have the following features:
1) Location independent: For example, across multiple AWS availability zones.
2) Distributed query processing: For example, Apache Spark, Hive, Impala, etc
3) Distributed transaction management (E,g. Paxos, Raft, Zab, etc) & hardware, operating system & network independent.
When to use distributed databases
1) When the amount of data stored & the number of reads are getting very larger & larger the distributed databases are much easier to horizontally scale by adding more nodes/computers. They can also store both structured & unstrctured data.
2) When the data and DBMS software are distributed over several nodes & sites, one site may fail whilst other sites continue to operate and you are not only able to access the data, but also access the data that exist at the failed site due to the data replication. This gives increased reliability and availability as there is no single point of failure.
3) When the queries are broken up & executed in parallel better performance can be achieved for a very large volume of data.
Q5. What are some of the basic concepts you must know when working with the distributed systems?
You need to have a way to distribute the data across multiple nodes. For example, hashing a part of every table’s primary key the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider:
Partition key must have enough values to spread the data for each table evenly across all the nodes in the cluster. Otherwise your data will be skewed, where some nodes will be working harder than the other nodes. The partitions which are highly loaded become the bottlenecks for the system, and this is known as hotspotting.
Typical real-world partition keys are user id, device id, account number etc. along with a time modifier like year-month or year added to the partition key to spread the data.
Key Range partitioning is form of partitioning, where you divide the entire keys into continuous ranges and assign each range to a partition. The downside of using key range partitioning is that if the range boundaries are not decided properly, it may lead to hotspots. Key Hash Partitioning is a form of partitioning to apply a hash function on the key which results in a hash value, and then mod it with the number of partitions. The same key will always return the same hash code. The real problem with hash partitioning starts when you change the number of partitions. Consistent Hashing to the rescue where the output range of a hash function is treated as a fixed circular space or “ring”. Each node in the system is assigned a random value within this space which represents its “position” on the ring.
The same piece of data is usually replicated to multiple machines/nodes to give fault tolerance & high availability.
When you update that piece of information on one of the machine/node, it may take some time (usually milliseconds) to reach every machine/node that holds the replicated data. This creates the possibility that you might get information that hasn’t yet updated on the replica. In many scenarios, the this anomaly is acceptable. In scenarios where this type of behaviour is not acceptable, the NoSQL databases may give the the choice, per transaction, whether data to be eventually consistent or strongly consistent.
4. Partition tolerance
Tolerating partitioning in the network, when you have a number of EC2 nodes in a VPC across multiple availability zones and a particular availability zone goes down, then the system should either favour consistency or availability as per the CAP theorem diagram shown above.
HBase, MongoDB, Redis, etc favours consistency, whilst Cassandra and AWS Dynamo DB favours availability.
5. File formats & compression
Distributed systems use container data formats like Avro, Parquet, ORC, Sequence file, etc where the data can be compressed with algorithms like LZO, Snappy, LZ4, etc and data can be easily split to be distributed across multiple nodes. 4 key considerations in choosing the storage file formats are:
1) Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2) Splittability to be processed in parallel.
3) Block compression saving storage space vs read/write/transfer performance
4) Schema evolution to add fields, modify fields, and rename fields.
Data can be stored schema-less and a structure can applied during processing (i.e. on-read) time based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema need to be defined before the data can be loaded.
Despite the schema-less nature, schema design is an important consideration. This includes directory structures and schema of objects stored as metadata (E.g. Hive meta store).
7. Shared nothing architecture
In distributed systems each node is completely independent of other nodes. There are no shared resources like CPU, memory, and disk storage that can become a bottle-neck. For example, Hadoop’s processing frameworks like Spark, Pig, Hive, Impala, etc processes distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very
1. Scalable as more nodes can be added without further contention.
2. Fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.
Q6. What is a Paxos protocol?
A6. Distributed systems are implemented for high availability and scalability using many commodity machines, hence these machines are not reliable and these come up and go down quite often. In such systems, there is often a need to agree upon “something” i.e. to have consensus. It needs to ensure that all servers in the system can agree on a single source of truth, even if some servers fail. In other words, the system must be fault-tolerant.
Each server node keeps its own log and nodes can communicate with each other to establish the order of the events. How can all nodes agree on a consistent view of the log? Paxos is a consensus protocol to agree on the order. A distributed consensus ensures a consensus of data (e.g. logs, configurations, etc) among nodes in a distributed system or reaches an agreement on a proposal. This is applicable to any distributed systems such as HDFS, ZooKeeper, Kafka, Redis, and Elasticsearch.
Paxos/Raft/Zab are the different variation of distributed consensus algorithms. They follow 2 key principles:
1) two phase commit, and
2) quorum majority vote.
This means a value is committed only when majority nodes have gone through the above 2-phase commit process.
1) A leader proposes values to the followers.
2) Leaders wait for acknowledgements from a quorum of followers before considering a proposal committed.
Q. What is a Quorum?
A. Minimum number of servers required to run the Zookeeper is called Quorum. Zookeeper replicates whole data tree to all the quorum servers. This number is also the minimum number of servers required to store a client’s data before telling the client it is safely stored
Zab stands for Zookeeper Atomic Broadcast. ZooKeeper is not a consensus protocol. Learn more at Apache Zookeeper interview Q&As
Raft algorithms simplify Paxos algorithms. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems.