Monitoring Stack¶
The cluster uses kube-prometheus-stack for comprehensive monitoring.
Components¶
| Component | URL | Purpose |
|---|---|---|
| Prometheus | https://prometheus.ragas.cc | Metrics collection & storage |
| Grafana | https://grafana.ragas.cc | Visualization & dashboards |
| Alertmanager | https://alertmanager.ragas.cc | Alert routing & notification |
Access¶
Grafana admin credentials are managed via a Kubernetes Secret (referenced by the kube-prometheus-stack HelmRelease) and are not the chart defaults.
Architecture¶
flowchart TB
subgraph "Monitoring Stack"
direction TB
sm["ServiceMonitors / scrape targets"] --> prom["Prometheus<br>metrics"]
prom --> graf["Grafana<br>dashboards"]
graf -->|"query"| prom
prom -->|"alerts"| am["Alertmanager<br>alerts"]
end
Pre-built Dashboards¶
kube-prometheus-stack includes these dashboards:
- Kubernetes / Compute Resources / Cluster
- Kubernetes / Compute Resources / Namespace (Pods)
- Kubernetes / Compute Resources / Node (Pods)
- Kubernetes / Networking / Cluster
- Node Exporter / Nodes
- CoreDNS
- etcd
Adding Custom Dashboards¶
Method 1: ConfigMap¶
apiVersion: v1
kind: ConfigMap
metadata:
name: my-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
my-dashboard.json: |
{
"title": "My Dashboard",
...
}
Method 2: Grafana UI¶
- Login to Grafana
- Create dashboard
- Save dashboard
- Export JSON and add to ConfigMap for persistence
Adding ServiceMonitors¶
To monitor a new service, create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- default
endpoints:
- port: metrics
interval: 30s
Alerting¶
View Alerts¶
# Current alerts in Prometheus
curl -s https://prometheus.ragas.cc/api/v1/alerts | jq
# Alertmanager status
curl -s https://alertmanager.ragas.cc/api/v2/alerts | jq
Custom AlertRule¶
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: my-app
rules:
- alert: MyAppDown
expr: up{job="my-app"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MyApp is down"
description: "MyApp has been down for 5 minutes"
Storage¶
| Component | Storage | Size |
|---|---|---|
| Prometheus | PVC (ceph-block) |
50Gi |
| Alertmanager | PVC (ceph-block) |
5Gi |
| Grafana | PVC (ceph-block) |
10Gi |
Retention¶
Default retention settings: - Time: 7 days - Size: 45GB
Adjust in HelmRelease:
Troubleshooting¶
Prometheus not scraping¶
# Check targets
curl https://prometheus.ragas.cc/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# Check ServiceMonitor
kubectl get servicemonitor -A
kubectl describe servicemonitor <name> -n monitoring
Grafana datasource issues¶
# Check datasource config
kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath='{.data.datasources\.yaml}' | base64 -d
High memory usage¶
# Check Prometheus memory
kubectl top pod -n monitoring -l app.kubernetes.io/name=prometheus
# Check cardinality
curl https://prometheus.ragas.cc/api/v1/status/tsdb | jq
Files¶
- HelmRelease:
kubernetes/apps/monitoring/kube-prometheus-stack/app/helmrelease.yaml - HTTPRoutes:
kubernetes/apps/monitoring/kube-prometheus-stack/app/httproutes.yaml - Kustomization:
kubernetes/apps/monitoring/kube-prometheus-stack/ks.yaml