A distributed system consists of multiple software components that are on multiple computers (aka nodes), but run as a single system. These components can be stateful, stateless, or serverless, and these components can be created in different languages running on hybrid environments and developing open-source technologies, open standards, and interoperability.
Q01. What are the key requirements to be a distributed system?
A01. A distributed system must satisfy the following 3 characteristics.
1) The computers or nodes operate concurrently.
2) The computers or nodes fail independently, hence must be fault tolerant. Also, scale independently where you can add new nodes to the cluster.
3) The computers or nodes do not share a global clock. The nodes could be on the cloud across different availability zones & time zones. “ap-southeast-1” & “ap-southeast-2” are different availability zones in the region “asia-pacific (Sydney)”.
Q2. What are some of the distributed systems, and why are they so popular?
A2. Modern architectures use:
1) distributed Storage &
2) distributed computation.
Microservices use both distributed computing & storage. The distributed nature has benefits & pose new challenges to solve. Firstly, let’s look at the Advantages:
Microservices are independently deployable, independently scalable, and more fault tolerant as you can isolate a failure to a single service and prevent cascading failures that would cause the whole app to crash.
1) Faster & parallel development & frequent deployments as each microservice can be deployed independently, as needed, enabling continuous improvement and faster app updates. Specific microservices can be assigned to specific development teams, which allows them to focus solely on one service or feature. This means teams can work autonomously without worrying what’s going on with the rest of the app.
You can have deployments every week. Microservice typically have small codebases, making them easier to maintain and deploy. Each Microservice will have its own data (i.e. SQL or NoSQL database). Due to the independence of each microservice, teams don’t have to worry about coding conflicts, and they don’t have to wait for slower-moving projects before launching their part of the application. It’s also much easier to keep the code clean and for teams to be wholly responsible for specific services.
2) Improved scalability as the capacity of each microservice to run autonomously makes it relatively easy to add, remove, update, and scale individual microservices. This can be done without disrupting the other microservices that comprise the application. When demand increases, you only need to upgrade or add more resources to the microservice that is impacted by the increased demands.
3) More closely integrated with DevOps & aligned to business domains as Microservices support the CI/CD/CD development philosophy. This enables you to quickly deploy the core microservices of an application as a minimum viable product (MVP). Domain-driven design (DDD) advocates modelling based on the reality of business as relevant to your use cases.
4) Reduce downtime through fault isolation . If a specific microservice fails, you can isolate that failure to that single service and prevent cascading failures that would cause the app to crash. This fault isolation means that your critical application can stay up and running even when one of its modules fails. For example, if you have an ecommerce application where the customer feedback microservice fails, the customer can still place orders, track their orders, cancel the orders, etc.
5) Better Data Security & Compliance as each of the microservices protects its own sensitive data. When developers establish data connections among the microservices, they use secure APIs using token based security with OAuth2.0 and OpenID. A secure API safeguards the data it processes by making sure it’s only accessible to specifically authorized applications, users, and servers. Secured APIs in the finance, telecommunications & health domains make it easier to achieve compliance under HIPAA, GDPR, and other data security standards.
The disadvantages of microservices are mostly due to its distributed storage & computing. Microservices has all the associated complexities of the distributed systems discussed below.
1) Distributed Transaction Management as the transactions span across multiple databases across multiple services. “Service A” may use Amazon RDS (i.e. Relational Database Service) with Amazon Elastic Cache, and “Service B” may use Amazon NoSQL database DynamoDB, and “Service C” may use just Amazon RDS (i.e. Relational Database Service) with Kinesis to stream data to a machine learning system. Every Microservice must have its own rollback method, which either rollback changes or trigger subsequent actions (e.g. sending a notification).
Most feasible way to handle consistency across microservices is via eventual consistency. This model doesn’t enforce distributed ACID transactions across microservices. A better approach is to use event sourcing, which is an event-centric approach to business logic design and persistence. It favours to use some mechanisms of ensuring that the system would be eventually consistent at some point in the future. 9 Java Transaction Management Interview Q&As.
Use of event sourcing for persistence where a service persists each aggregate as a sequence of events. It reconstructs the current state of an aggregate by loading the events and replaying them. In functional programming terminology it is aggregated by a fold/reduce over the events.
2) Lots of moving parts requiring increased coordination & message routing.
If not managed & monitored properly there is a higher chance of failure during communication between different services. The testing & debugging can be complex over a distributed system. The developers need to properly address & manage network latency and load balancing.
#2. Big Data
Big Data systems use both distributed storage & computation. In Hadoop the storage & computation are collocated on multiple nodes whereas on the cloud they are decoupled & often preferred. For example, The S3 (i.e. Simple Storage Service) decouples compute against the storage requirements. This decoupling allows you to easily (i.e. elastically) scale up or down the storage requirements.
1) Distributed Storage: is accomplished via HDFS (i.e. Hadoop Distributed File System) where data is split & stored across 100’s of “Data Nodes“.
2) Distributed Computing: is accomplished via “YARN” (i.e. Yet Another Resource Negotiator), which is a node manager responsible for distributing the code to the nodes where the data is. The distributed code is then executed against the subset of the data on their respective nodes.
The data is distributed across multiple machines (aka nodes) & the computation code is also copied across multiple machines & executed (E.g. via Apache Spark executor) against each machine’s data concurrently. Each machine will return a subset of the data to the master node to be combined & returned the final result to the client. So, unlike multithreading, which shares CPU, storage & Memory with other threads running within the same machine, the distributed systems use share nothing architecture whereby each node will have its own CPU, storage & memory. These nodes will horizontally scale. In cloud computing this is known as “auto scaling” with a flick of a configuration change via a console.
#3 More Examples of Distributed Systems
1) Distributed storage systems like Apache Hadoop Distributed File System (i.e. HDFS), NoSQL databases like HBase, Cassandra, MongoDB, Redis, etc and Cloud based data warehouses like Snowflake, Apache Hive, Amazon Redshift, etc.
2) Distributed computing with Microservices, Apache Spark, Apache Storm, AWS EMR (i.e. Elastic Map Reduce), etc.
3) Distributed messaging systems like Apache Kafka, Amazon Kinesis Data Streams, Amazon SQS (i.e. Simple Queue Service), Amazon SNS (i.e. Simple Notification Service with pub/sub model), etc. Amazon SQS & SNS are useful in building Microservices.
Q03. Why distributed systems are popular?
1) Firstly, modern architectures are cloud based.
2) Modern architectures need to scale independently. Server nodes in 100s with fault-tolerance as opposed to in tens. So, be prepared to use terms like the application was deployed to “24 core 100 node cluster in AWS”, “3.6 Petabytes of structured & semi-structured data was stored across 280 node cluster, etc.
3) Data sizes in Terabytes & Petabytes as opposed to gigabytes. You need to handle not only structured data, but also semi-structured data like XML, JSON, log files, event logs, etc and unstructured data like e-mail messages, PDFs, Word documents, etc.
4) Distributed Architectures are often 1) event-driven and 2) asynchronous with more complexities to Solution & debug.
Q04. What are some of the key considerations when designing distributed systems?
a) Cap theorem asserts that any networked shared-data system can have only two of three desirable properties as in “Consistency & Availability”, “Availability & Partition-tolerance”, or “Consistency & Partition-tolerance”.
b) Nodes can fail independently & new nodes can be added. So, concepts like Consistent Hashing ring, coordinating services like Zookeeper where you can use distribute configuration settings across nodes, have distributed counters, distributed locks, etc.
c) Out of order processing as events are streamed potentially in out of order from various event sources requiring additional logic.
e) Replay considerations as the master data in the HDFS (i.e. Hadoop Distributed File System) raw zone needs to be rerun in batch modes with renewed access patterns, business logic & rules. Append only systems (E.g. Big Data on HDFS) don’t update data models, but keep appending to the existing data & evaluate the current state by 1) deleting the current data models and 2) replaying the event logs (i.e. the master data). These jobs are run as batch jobs with eventual consistency. This is known as event-sourcing.
f) The NoSQL (i.e. Not Only SQL) databases make development quicker as they are schema-less (i.e. implicit schema), scale rapidly by adding more nodes & handle large volumes of aggregated data (i.e. you store Order, Line Items, and possibly Customer details together as a JSON clob or key-value pairs), which minimizes the number of joins. The NoSQL databases don’t handle ACID (e.g. transaction boundaries, isolation levels, constraints like foreign key, etc) properties, and shift that responsibility to the application layer. This can add more challenges.
g) The data structuring, partitioning & distribution challenges – Skewed data, data shuffling, cartesian join prevention, etc. You will see potential challenges and solutions on the post 15 Apache Spark best practices & performance tuning interview FAQs.
h) Race conditions, performance issues, and out-of-memory issues when dealing with larger volumes of immutable & append-only data logs that need to be replayed into memory or indexed data models for responsiveness & real-time querying.
Q05. Can you name some of the distributed systems?
Snowflake is a fully cloud based unlimited storage and compute. Snowflake is a massively parallel processing (i.e. MPP) database that is fully relational, ACID compliant, and processes standard SQL.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Apache Kafka is a distributed streaming platform that is used to build real time streaming data pipelines and applications that adapt to data streams.
Q06. How will you be going about choosing between an SQL Vs NoSQL database?
A06. Any non-trivial application requires a persistent store. Now a days you have many choices between monolithic RDBMs databases & distributed NoSQL databases like MongoDB, HBase, Cassandra, etc.
Q07. What is a distributed database?
A07. Distributed databases are logically interrelated with each other to often represent as a single logical database, but the data is physically stored across multiple nodes in multiple sites and independently managed. In general, distributed databases have the following features:
1) Location independent: For example, across multiple AWS availability zones.
2) Distributed query processing: For example, Apache Spark, Hive, Impala, etc
3) Distributed transaction management (E,g. Paxos, Raft, Zab, etc) & hardware, operating system & network independent.
Q08. When will you consider using a distributed database?
1) When the amount of data stored & the number of reads are getting very larger & larger the distributed databases are much easier to horizontally scale by adding more nodes/computers. They can also store structured (e.g relational data), semistructured (e.g JSON, XML) & unstrctured (e.g documents, log files) data.
2) When the data and DBMS software are distributed over several nodes & sites, one site may fail whilst other sites continue to operate and you are not only able to access the data, but also access the data that exist at the failed site due to the data replication. This gives increased reliability and availability as there is no single point of failure.
3) When the queries are broken up & executed in parallel better performance can be achieved for a very large volume of data.
Q09. What are some of the basic concepts you must know when working with the distributed systems?
You need to have a way to distribute the data across multiple nodes. For example, hashing a part of every table’s primary key the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider:
Partition key must have enough values to spread the data for each table evenly across all the nodes in the cluster. Otherwise your data will be skewed, where some nodes will be working harder than the other nodes. The partitions which are highly loaded become the bottlenecks for the system, and this is known as hotspotting.
Typical real-world partition keys are user id, device id, account number etc. along with a time modifier like year-month or year added to the partition key to spread the data.
Key Range partitioning is form of partitioning, where you divide the entire keys into continuous ranges and assign each range to a partition. The downside of using key range partitioning is that if the range boundaries are not decided properly, it may lead to hotspots. Key Hash Partitioning is a form of partitioning to apply a hash function on the key which results in a hash value, and then mod it with the number of partitions. The same key will always return the same hash code. The real problem with hash partitioning starts when you change the number of partitions. Consistent Hashing to the rescue where the output range of a hash function is treated as a fixed circular space or “ring”. Each node in the system is assigned a random value within this space which represents its “position” on the ring.
The same piece of data is usually replicated to multiple machines/nodes to give fault tolerance & high availability.
When you update that piece of information on one of the machine/node, it may take some time (usually milliseconds) to reach every machine/node that holds the replicated data. This creates the possibility that you might get information that hasn’t yet updated on the replica. In many scenarios, the this anomaly is acceptable. In scenarios where this type of behaviour is not acceptable, the NoSQL databases may give the the choice, per transaction, whether data to be eventually consistent or strongly consistent.
4. Partition tolerance
Tolerating partitioning in the network, when you have a number of EC2 nodes in a VPC across multiple availability zones and a particular availability zone goes down, then the system should either favour consistency or availability as per the CAP theorem diagram shown above.
HBase, MongoDB, Redis, etc favours consistency, whilst Cassandra and AWS Dynamo DB favours availability.
5. File formats & compression
Distributed systems use container data formats like Avro, Parquet, ORC, Sequence file, etc where the data can be compressed with algorithms like LZO, Snappy, LZ4, etc and data can be easily split to be distributed across multiple nodes. 4 key considerations in choosing the storage file formats are:
1) Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2) Splittability to be processed in parallel.
3) Block compression saving storage space vs read/write/transfer performance
4) Schema evolution to add fields, modify fields, and rename fields.
Data can be stored schema-less and a structure can applied during processing (i.e. on-read) time based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema need to be defined before the data can be loaded.
Despite the schema-less nature, schema design is an important consideration. This includes directory structures and schema of objects stored as metadata (E.g. Hive meta store).
7. Shared nothing architecture
In distributed systems each node is completely independent of other nodes. There are no shared resources like CPU, memory, and disk storage that can become a bottle-neck. For example, Hadoop’s processing frameworks like Spark, Pig, Hive, Impala, etc processes distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very
1. Scalable as more nodes can be added without further contention.
2. Fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.
Q10. What are the needs of distributed applications?
A10. Distributed applications need a proper lifecycle management involving
1) Deployment & rollback of the apps.
2) Scheduling the app.
3) Distributed config management.
4) Resource failure isolation.
5) Auto or manual scaling.
6) Handling hybrid workloads as in stateful, stateless, serverless, etc.
7) Service discovery & failover.
8) Dynamic routing.
9) Service retry, timeout & circuit breaking.
10) Security & data privacy
11) Rate limiting or throttling the API calls.
12) Monitoring & distributed logging.
13) Distributed cacheing.
14) Protocol conversions.
15) Message transformation.
16) Message routing with point-to-point or pub/sub.
17) Applicate state mgmt, transaction management (e.g SAGA pattern, event sourcing, etc), etc.
Q11. What is a Paxos protocol?
A11. Distributed systems are implemented for high availability and scalability using many commodity machines, hence these machines are not reliable and these come up and go down quite often. In such systems, there is often a need to agree upon “something” i.e. to have consensus. It needs to ensure that all servers in the system can agree on a single source of truth, even if some servers fail. In other words, the system must be fault-tolerant.
Each server node keeps its own log and nodes can communicate with each other to establish the order of the events. How can all nodes agree on a consistent view of the log? Paxos is a consensus protocol to agree on the order. A distributed consensus ensures a consensus of data (e.g. logs, configurations, etc) among nodes in a distributed system or reaches an agreement on a proposal. This is applicable to any distributed systems such as HDFS, ZooKeeper, Kafka, Redis, and Elasticsearch.
Paxos/Raft/Zab are the different variation of distributed consensus algorithms. They follow 2 key principles:
1) two phase commit, and
2) quorum majority vote.
This means a value is committed only when majority nodes have gone through the above 2-phase commit process.
1) A leader proposes values to the followers.
2) Leaders wait for acknowledgements from a quorum of followers before considering a proposal committed.
Q12. What is a Quorum?
A12. Minimum number of servers required to run the Zookeeper is called Quorum. Zookeeper replicates whole data tree to all the quorum servers. This number is also the minimum number of servers required to store a client’s data before telling the client it is safely stored
Zab stands for Zookeeper Atomic Broadcast. ZooKeeper is not a consensus protocol. Learn more at Apache Zookeeper interview Q&As
Raft algorithms simplify Paxos algorithms. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems.