Skip to content

Monitoring Stack

The cluster uses kube-prometheus-stack for comprehensive monitoring.

Components

Component URL Purpose
Prometheus https://prometheus.ragas.cc Metrics collection & storage
Grafana https://grafana.ragas.cc Visualization & dashboards
Alertmanager https://alertmanager.ragas.cc Alert routing & notification

Access

Grafana admin credentials are managed via a Kubernetes Secret (referenced by the kube-prometheus-stack HelmRelease) and are not the chart defaults.

Architecture

flowchart TB
  subgraph "Monitoring Stack"
    direction TB
    sm["ServiceMonitors / scrape targets"] --> prom["Prometheus<br>metrics"]
    prom --> graf["Grafana<br>dashboards"]
    graf -->|"query"| prom
    prom -->|"alerts"| am["Alertmanager<br>alerts"]
  end

Pre-built Dashboards

kube-prometheus-stack includes these dashboards:

  • Kubernetes / Compute Resources / Cluster
  • Kubernetes / Compute Resources / Namespace (Pods)
  • Kubernetes / Compute Resources / Node (Pods)
  • Kubernetes / Networking / Cluster
  • Node Exporter / Nodes
  • CoreDNS
  • etcd

Adding Custom Dashboards

Method 1: ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    {
      "title": "My Dashboard",
      ...
    }

Method 2: Grafana UI

  1. Login to Grafana
  2. Create dashboard
  3. Save dashboard
  4. Export JSON and add to ConfigMap for persistence

Adding ServiceMonitors

To monitor a new service, create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - port: metrics
      interval: 30s

Alerting

View Alerts

# Current alerts in Prometheus
curl -s https://prometheus.ragas.cc/api/v1/alerts | jq

# Alertmanager status
curl -s https://alertmanager.ragas.cc/api/v2/alerts | jq

Custom AlertRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: my-app
      rules:
        - alert: MyAppDown
          expr: up{job="my-app"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "MyApp is down"
            description: "MyApp has been down for 5 minutes"

Storage

Component Storage Size
Prometheus PVC (ceph-block) 50Gi
Alertmanager PVC (ceph-block) 5Gi
Grafana PVC (ceph-block) 10Gi

Retention

Default retention settings: - Time: 7 days - Size: 45GB

Adjust in HelmRelease:

prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 20GB

Troubleshooting

Prometheus not scraping

# Check targets
curl https://prometheus.ragas.cc/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

# Check ServiceMonitor
kubectl get servicemonitor -A
kubectl describe servicemonitor <name> -n monitoring

Grafana datasource issues

# Check datasource config
kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath='{.data.datasources\.yaml}' | base64 -d

High memory usage

# Check Prometheus memory
kubectl top pod -n monitoring -l app.kubernetes.io/name=prometheus

# Check cardinality
curl https://prometheus.ragas.cc/api/v1/status/tsdb | jq

Files

  • HelmRelease: kubernetes/apps/monitoring/kube-prometheus-stack/app/helmrelease.yaml
  • HTTPRoutes: kubernetes/apps/monitoring/kube-prometheus-stack/app/httproutes.yaml
  • Kustomization: kubernetes/apps/monitoring/kube-prometheus-stack/ks.yaml