domain platform Commons: 3/5

Health Check API Pattern

Also known as: Health Endpoint Monitoring

1. Overview

The Health Check API pattern is a fundamental design pattern for building resilient and observable distributed systems. It involves exposing an endpoint from a service that external tools can query to determine the service’s health. This pattern is crucial in microservices architectures, where applications are composed of numerous interacting services, making it essential to have a standardized way to monitor their status [1]. The concept of health checks has been a long-standing practice in system administration and network monitoring, evolving from simple ICMP “ping” requests to more sophisticated application-level checks.

2. Core Principles

The Health Check API pattern is based on a few core principles:

  • Separation of Concerns: The health check logic is encapsulated within a dedicated endpoint, separating it from the main business logic of the service.
  • Standardization: The pattern promotes a standardized way of reporting health, making it easier for monitoring systems to consume and interpret the health status of different services.
  • Automation: Health checks are designed to be automated, allowing for continuous and periodic monitoring of services without manual intervention.
  • Actionability: The health status reported by the endpoint should be actionable, enabling automated systems to take corrective actions, such as restarting a service or redirecting traffic.

3. Key Practices

In a distributed system, a service instance can be running from a process perspective but may not be able to handle requests correctly. This can happen for various reasons, such as:

  • Loss of connectivity to a database or another critical dependency.
  • Exhaustion of system resources like memory, CPU, or disk space.
  • Application-specific errors or deadlocks.

When a service is in such a state, it is considered “unhealthy.” Sending traffic to an unhealthy service instance can lead to cascading failures and impact the overall availability and reliability of the system. Therefore, a mechanism is needed to detect unhealthy service instances so that they can be taken out of service and the issue can be investigated.

4. Implementation

The Health Check API pattern provides a solution to this problem by having each service expose an API endpoint (e.g., /health or /status) that returns the health of the service. A monitoring system, load balancer, or service registry can then periodically invoke this endpoint to check the health of the service instance.

The health check endpoint should perform a series of checks to determine the health of the service, which can include:

  • Internal State: Checking the internal state of the service to ensure it is functioning correctly.
  • Dependency Checks: Verifying the connectivity and health of external dependencies such as databases, caches, and other services.
  • Resource Checks: Monitoring the availability of system resources like disk space, memory, and CPU.

The response from the health check endpoint typically includes a status code (e.g., HTTP 200 for healthy, HTTP 503 for unhealthy) and a body containing more detailed information about the health of the service and its dependencies.

5. 7 Pillars Assessment

Pillar Score (1-5) Rationale
Purpose 3 Serves a clear technical purpose in system design
Governance 3 Can be governed through standard engineering practices
Culture 3 Supports engineering culture of reliability and quality
Incentives 3 Aligns incentives toward system stability
Knowledge 4 Well-documented pattern with extensive community knowledge
Technology 4 Directly applicable to modern technology stacks
Resilience 4 Contributes to overall system resilience
Overall 3.4 A valuable technical pattern that supports commons infrastructure
Aspect Pro Con
Reliability Improves system reliability by enabling the detection and removal of unhealthy service instances from the load balancer’s rotation. A poorly implemented health check can itself become a point of failure or provide misleading information.
Observability Enhances the observability of the system by providing a clear and standardized way to monitor the health of services. Health checks can add extra traffic to the network and services, which needs to be considered in high-traffic environments.
Complexity The basic implementation of a health check is relatively simple. A comprehensive health check that covers all dependencies and potential failure modes can be complex to implement and maintain.

6. When to Use

  • Kubernetes: Kubernetes uses liveness and readiness probes, which are essentially health checks, to determine the health of containers. If a liveness probe fails, Kubernetes will restart the container. If a readiness probe fails, Kubernetes will stop sending traffic to the container.
  • Spring Boot Actuator: The Spring Boot Actuator module provides a /health endpoint out of the box, which can be customized to include checks for various dependencies and system resources.
  • Consul: Consul, a popular service discovery and configuration tool, uses health checks to monitor the health of services and update its service registry accordingly.

7. Anti-Patterns & Gotchas

In the cognitive era, where AI and machine learning are increasingly integrated into applications, the Health Check API pattern remains highly relevant. Health checks can be enhanced to monitor the health of AI/ML models and their associated infrastructure. For example, a health check could verify that a model is loaded correctly, that it can make predictions within an acceptable time frame, and that the data it is receiving is valid. Furthermore, the data collected from health checks can be used to train machine learning models to predict service failures before they occur, enabling proactive and predictive maintenance.

8. References

The Health Check API pattern aligns with the principles of the Commons in several ways:

  • Shared Resource: The health status of a service is a shared resource that is made available to other parts of the system, such as monitoring tools and load balancers.
  • Democratic Governance: The pattern promotes a standardized and democratic approach to health monitoring, where any authorized component can query the health of a service.
  • Equitable Access: By providing a well-defined API, the pattern ensures that all components have equitable access to the health information of a service.
  • Sustainability: By enabling the early detection of issues and promoting the resilience of the system, the pattern contributes to the long-term sustainability of the platform.
  • Community Benefit: The improved reliability and observability provided by the pattern benefit the entire community of users and developers who rely on the platform.

Overall, the Health Check API pattern is a valuable tool for building robust and sustainable platforms that align with the principles of the Commons.

8. References

[1] Microservices.io. Pattern: Health Check API. https://microservices.io/patterns/observability/health-check-api.html [2] GeeksforGeeks. Health Endpoint Monitoring Pattern. https://www.geeksforgeeks.org/system-design/health-endpoint-monitoring-pattern/ [3] API7.ai. Top Tips for Implementing Health Check Best Practices. https://api7.ai/blog/tips-for-health-check-best-practices