System Design

Abstract

Designing robust and efficient systems that fulfil the needs of Client smoothly
Kahoot! Quizzes to learn about Availability (可用性) vs Scalability (可扩展性) vs Fault Tolerance (容错性) vs Reliability (可靠性)

Availability (可用性)

Refers to the percentage of time that a system is operational and available for use
It ensures the system is accessible when needed, minimizing downtime and maintaining a consistent user experience
Can be achieved directly using Database Replication, Multi Data Center Setup or indirectly with good Fault Tolerance (容错性) and good Reliability (可靠性)

Scalability (可扩展性)

Refers to the capability of a system to handle a growing amount of work, or its potential to be enlarged to accommodate that growth
It ensures system can handle increased load efficiently by adding resources or optimizing existing ones. It ensures that the system can grow to meet the demands of a larger user base or increased data volume, ensuring Availability
Can be achieved with Cache Server, Stateless Compute Server, Message Queue (消息队列) & Database Scaling

Vertical Scaling

Basically adding more CPU and Main Memory to a single Server
Simple to implement, great option when traffic is low

Vertical Scaling Limitations

Hard Limit

It is impossible to add unlimited CPU, Main Memory & Disk etc to a single Server

No Failover

Single Point of Failure. No Fault Tolerance (容错性)

Expensive

Powerful servers are much more expensive

Horizontal Scaling

Basically adding more Server, handle the traffic in Parallelism (并行性)
More desirable for large scale applications due to Vertical Scaling Limitations
Usually has a Load Balancer sits between the client and server, to distribute the traffic evenly to the servers

CAP Theorem in distributed systems

You can only choose two out of the three.

Consistency

All nodes display identical data, guaranteeing that reads always reflect the most recent write.

Availability

Every request receives a response, without guarantee that it contains the most recent writes.

Partition Tolerance

The system continues to operate despite network failures.

If a system prioritizes consistency (CP), it may become unavailable during a partition to ensure that data remains consistent across nodes.

If a system prioritises availability (AP), it may sacrifice consistency during a partition, allowing nodes in different partitions to respond, even though they might not have the latest data.

Fault Tolerance (容错性)

Refers to a system’s ability to continue operating and providing its intended services even in the presence of hardware or software faults
It ensures that a system can recover from failures, keeping disruptions minimal and maintaining the Availability of services
Fault Tolerance for stateless system can be achieved with Load Balancer’s Failover Capability with Stateless Compute Server etc
Fault Tolerance for stateful system can be achieved with Database Replication and Replicated State Machine etc
Or both with Multi Data Center Setup

Single Point of Failure

A part of a system that, if it fails, will stop the entire system from working

Reliability (可靠性)

Refers to the ability of a system to perform a specified function without failure over a specified period
It ensures consistent and predictable behavior of a system. It involves minimizing the chances of failures and, in case of failures, having mechanisms in place for quick recovery
Can be achieved with Monitoring and automation like ci/cd pipeline

Efficiency

Latency

Delay in first response.

Throughput

Operations per time unit.

CS Notes

Recent Updates

Database Search

Cron Jobs and Enhanced Monitoring Tools

Race Condition (竞态条件)

Explorer

System Design

Abstract

Availability (可用性)

Scalability (可扩展性)

Vertical Scaling

Vertical Scaling Limitations

Horizontal Scaling

Fault Tolerance (容错性)

Single Point of Failure

Reliability (可靠性)

Efficiency

References

Table of Contents

Backlinks

Graph View