Monitoring

Abstract

Basically collecting info from different components of the system to help us to better manage the system. We can use tools like Datadog for this task
Improving the system’s Reliability (可靠性)

A time-bound information related to a system captured at a certain point in time like per second/min
Collecting different types of metrics help us to gain business insights and understand the health status of the system

Metric that indicates the top-level health of system by measuring its useful output
Examples are success rate & error rate

Metric that indicates timely information of physical resources like CPU & Main Memory
Examples are utilisation

A detailed list of Events that happen within the system/application
Examples can be web server log which contains the IP, data & time of HTTP Request
Monitoring error logs is important because it helps to identify errors and problems in the system
We can use Datadog to aggregate them for easy search and viewing

A tool or service that collects log data from various sources and forwards or routes it to one or more destinations
Play a crucial role in centralized logging architectures, especially in environments with multiple applications, services, or systems that generate logs
Examples are Fluentd, Fluent Bit(If you need a lightweight, high-performance log shipper, especially for containerized or edge environments, Fluent Bit is the way to go), Logstash (part of the ELK Stack), and AWS FireLens (for Amazon ECS and EKS)