Monitoring Stack¶

The cluster uses kube-prometheus-stack for comprehensive monitoring.

Components¶

Component	URL	Purpose
Prometheus	https://prometheus.ragas.cc	Metrics collection & storage
Grafana	https://grafana.ragas.cc	Visualization & dashboards
Alertmanager	https://alertmanager.ragas.cc	Alert routing & notification

Access¶

Grafana admin credentials are managed via a Kubernetes Secret (referenced by the kube-prometheus-stack HelmRelease) and are not the chart defaults.

Architecture¶

flowchart TB
  subgraph "Monitoring Stack"
    direction TB
    sm["ServiceMonitors / scrape targets"] --> prom["Prometheus<br>metrics"]
    prom --> graf["Grafana<br>dashboards"]
    graf -->|"query"| prom
    prom -->|"alerts"| am["Alertmanager<br>alerts"]
  end

Pre-built Dashboards¶

kube-prometheus-stack includes these dashboards:

Kubernetes / Compute Resources / Cluster
Kubernetes / Compute Resources / Namespace (Pods)
Kubernetes / Compute Resources / Node (Pods)
Kubernetes / Networking / Cluster
Node Exporter / Nodes
CoreDNS
etcd

Adding Custom Dashboards¶

Method 1: ConfigMap¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    {
      "title": "My Dashboard",
      ...
    }

Method 2: Grafana UI¶

Login to Grafana
Create dashboard
Save dashboard
Export JSON and add to ConfigMap for persistence

Adding ServiceMonitors¶

To monitor a new service, create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - port: metrics
      interval: 30s

Alerting¶

View Alerts¶

# Current alerts in Prometheus
curl -s https://prometheus.ragas.cc/api/v1/alerts | jq

# Alertmanager status
curl -s https://alertmanager.ragas.cc/api/v2/alerts | jq

Custom AlertRule¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: my-app
      rules:
        - alert: MyAppDown
          expr: up{job="my-app"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "MyApp is down"
            description: "MyApp has been down for 5 minutes"

Storage¶

Component	Storage	Size
Prometheus	PVC (`ceph-block`)	50Gi
Alertmanager	PVC (`ceph-block`)	5Gi
Grafana	PVC (`ceph-block`)	10Gi

Retention¶

Default retention settings: - Time: 7 days - Size: 45GB

Adjust in HelmRelease:

prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 20GB

Troubleshooting¶

Prometheus not scraping¶

# Check targets
curl https://prometheus.ragas.cc/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

# Check ServiceMonitor
kubectl get servicemonitor -A
kubectl describe servicemonitor <name> -n monitoring

Grafana datasource issues¶

# Check datasource config
kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath='{.data.datasources\.yaml}' | base64 -d

High memory usage¶

# Check Prometheus memory
kubectl top pod -n monitoring -l app.kubernetes.io/name=prometheus

# Check cardinality
curl https://prometheus.ragas.cc/api/v1/status/tsdb | jq

Files¶

HelmRelease: kubernetes/apps/monitoring/kube-prometheus-stack/app/helmrelease.yaml
HTTPRoutes: kubernetes/apps/monitoring/kube-prometheus-stack/app/httproutes.yaml
Kustomization: kubernetes/apps/monitoring/kube-prometheus-stack/ks.yaml