Abstract
- Designing robust and efficient systems that fulfil the needs of Client smoothly
- Kahoot! Quizzes to learn about Availability (可用性) vs Scalability (可扩展性) vs Fault Tolerance (容错性) vs Reliability (可靠性)
Availability (可用性)
- Refers to the percentage of time that a system is operational and available for use
- It ensures the system is accessible when needed, minimizing downtime and maintaining a consistent user experience
- Can be achieved directly using Database Replication, Multi Data Center Setup or indirectly with good Fault Tolerance (容错性) and good Reliability (可靠性)
Scalability (可扩展性)
- Refers to the capability of a system to handle a growing amount of work, or its potential to be enlarged to accommodate that growth
- It ensures system can handle increased load efficiently by adding resources or optimizing existing ones. It ensures that the system can grow to meet the demands of a larger user base or increased data volume, ensuring Availability
- Can be achieved with Cache Server, Stateless Compute Server, Message Queue (消息队列) & Database Scaling
Vertical Scaling
- Basically adding more CPU and Main Memory to a single Server
- Simple to implement, great option when traffic is low
Vertical Scaling Limitations
Hard Limit
- It is impossible to add unlimited CPU, Main Memory & Disk etc to a single Server
No Failover
Expensive
- Powerful servers are much more expensive
Horizontal Scaling
- Basically adding more Server, handle the traffic in Parallelism (并行性)
- More desirable for large scale applications due to Vertical Scaling Limitations
- Usually has a Load Balancer sits between the client and server, to distribute the traffic evenly to the servers
CAP Theorem in distributed systems
You can only choose two out of the three.
Consistency
- All nodes display identical data, guaranteeing that reads always reflect the most recent write.
Availability
- Every request receives a response, without guarantee that it contains the most recent writes.
Partition Tolerance
- The system continues to operate despite network failures.
If a system prioritizes consistency (CP), it may become unavailable during a partition to ensure that data remains consistent across nodes.
If a system prioritises availability (AP), it may sacrifice consistency during a partition, allowing nodes in different partitions to respond, even though they might not have the latest data.
Fault Tolerance (容错性)
- Refers to a system’s ability to continue operating and providing its intended services even in the presence of hardware or software faults
- It ensures that a system can recover from failures, keeping disruptions minimal and maintaining the Availability of services
- Fault Tolerance for stateless system can be achieved with Load Balancer’s Failover Capability with Stateless Compute Server etc
- Fault Tolerance for stateful system can be achieved with Database Replication and Replicated State Machine etc
- Or both with Multi Data Center Setup
Single Point of Failure
- A part of a system that, if it fails, will stop the entire system from working
Reliability (可靠性)
- Refers to the ability of a system to perform a specified function without failure over a specified period
- It ensures consistent and predictable behavior of a system. It involves minimizing the chances of failures and, in case of failures, having mechanisms in place for quick recovery
- Can be achieved with Monitoring and automation like ci/cd pipeline
Efficiency
Latency
Delay in first response.
Throughput
Operations per time unit.