Monitoring & Observability
Summary Overview
Comprehensive system monitoring stack using Prometheus and Grafana to track performance and health.
Monitoring Stack
Visibility is critical for any production system. The homelab employs a robust observability stack to track resource usage, container health, and traffic patterns. We use Prometheus for metrics collection and Grafana for visualization.
Data Collection (Prometheus)
Prometheus serves as the time-series database, actively scraping metrics from various targets every 15 seconds.
Scrape Configuration
The prometheus.yml file defines what to monitor. We use specific jobs for different layers of the stack.
# monitoring-stack/prometheus/prometheus.yml
scrape_configs:
# Node Exporter (System Metrics)
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Traefik (Edge Router Metrics)
- job_name: 'traefik'
static_configs:
- targets: ['traefik:8080'] # Internal Traefik API port
Targets Explained
-
Node Exporter:
- Role: Monitors the host Operating System (Azure VPS).
- Metrics: CPU usage, Memory consumption, Disk I/O, Network traffic.
- Why it matters: Alerts us if the underlying server is under stress or if we are running out of disk space.
-
cAdvisor (Container Advisor):
- Role: Monitors Docker containers.
- Metrics: Per-container RAM/CPU usage, network bandwidth.
- Why it matters: Identifies "noisy neighbor" containers or memory leaks in specific applications.
-
Traefik:
- Role: Monitors the edge router.
- Metrics: Request count, response codes (200 vs 404 vs 500), latency.
- Why it matters: Provides a real-time view of traffic health and potential attacks (spikes in 4xx/5xx errors).
Visualization (Grafana)
Grafana connects to Prometheus to render actionable dashboards.
Key Dashboards
- "Mission Control": High-level view of system uptime and critical alerts.
- Host Performance: Detailed breakdown of CPU/RAM cores.
- Container Health: Status of every running Docker service.
Data Persistence
To ensure we don't lose historical data during updates, we use Docker named volumes:
volumes:
prometheus-data: # Persists the TSDB (Time Series Database)
grafana-data: # Persists dashboards, users, and settings
Alerting Strategy
(Future Implementation) We plan to implement Alertmanager to route critical notifications to communication channels.
- High Latency: Trigger if Traefik p99 latency > 500ms for 5 minutes.
- Disk Space: Trigger if storage usage > 85%.
- Container Crash: Trigger if critical services (like Traefik) are down.