Monitoring & Observability

Summary Overview

Comprehensive system monitoring stack using Prometheus and Grafana to track performance and health.

MonitoringPrometheusGrafana

Monitoring Stack

Visibility is critical for any production system. The homelab employs a robust observability stack to track resource usage, container health, and traffic patterns. We use Prometheus for metrics collection and Grafana for visualization.

Data Collection (Prometheus)

Prometheus serves as the time-series database, actively scraping metrics from various targets every 15 seconds.

Scrape Configuration

The prometheus.yml file defines what to monitor. We use specific jobs for different layers of the stack.

# monitoring-stack/prometheus/prometheus.yml

scrape_configs:
  # Node Exporter (System Metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Traefik (Edge Router Metrics)
  - job_name: 'traefik'
    static_configs:
      - targets: ['traefik:8080'] # Internal Traefik API port

Targets Explained

  1. Node Exporter:

    • Role: Monitors the host Operating System (Azure VPS).
    • Metrics: CPU usage, Memory consumption, Disk I/O, Network traffic.
    • Why it matters: Alerts us if the underlying server is under stress or if we are running out of disk space.
  2. cAdvisor (Container Advisor):

    • Role: Monitors Docker containers.
    • Metrics: Per-container RAM/CPU usage, network bandwidth.
    • Why it matters: Identifies "noisy neighbor" containers or memory leaks in specific applications.
  3. Traefik:

    • Role: Monitors the edge router.
    • Metrics: Request count, response codes (200 vs 404 vs 500), latency.
    • Why it matters: Provides a real-time view of traffic health and potential attacks (spikes in 4xx/5xx errors).

Visualization (Grafana)

Grafana connects to Prometheus to render actionable dashboards.

Key Dashboards

  • "Mission Control": High-level view of system uptime and critical alerts.
  • Host Performance: Detailed breakdown of CPU/RAM cores.
  • Container Health: Status of every running Docker service.

Data Persistence

To ensure we don't lose historical data during updates, we use Docker named volumes:

volumes:
  prometheus-data: # Persists the TSDB (Time Series Database)
  grafana-data:    # Persists dashboards, users, and settings

Alerting Strategy

(Future Implementation) We plan to implement Alertmanager to route critical notifications to communication channels.

  • High Latency: Trigger if Traefik p99 latency > 500ms for 5 minutes.
  • Disk Space: Trigger if storage usage > 85%.
  • Container Crash: Trigger if critical services (like Traefik) are down.