Runbook: Node Failure¶
Symptoms¶
- Node shows
NotReadyinkubectl get nodes - Pods on the node are
PendingorUnknown - Alerts firing for node down
Quick Check¶
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check Talos status
talosctl -n <node-ip> health
talosctl -n <node-ip> services
Diagnosis¶
1. Network Connectivity¶
2. VM Status (if virtualized)¶
# Check VM status
ssh root@<pve-host> "qm status <vmid>"
# Check VM console
ssh root@<pve-host> "qm terminal <vmid>"
3. Talos Health¶
# Check all services
talosctl -n <node-ip> services
# Check specific service
talosctl -n <node-ip> service kubelet
talosctl -n <node-ip> service etcd # control plane only
# Check logs
talosctl -n <node-ip> logs kubelet
Recovery Procedures¶
Scenario 1: VM Not Running¶
# Start the VM
ssh root@<pve-host> "qm start <vmid>"
# Wait for boot
sleep 60
# Verify
talosctl -n <node-ip> health
Scenario 2: Talos Service Crashed¶
# Restart kubelet
talosctl -n <node-ip> service kubelet restart
# If etcd is unhealthy (control plane)
talosctl -n <node-ip> service etcd restart
Scenario 3: Node Unresponsive¶
# Hard reset via Proxmox
ssh root@<pve-host> "qm reset <vmid>"
# Or stop and start
ssh root@<pve-host> "qm stop <vmid> && sleep 5 && qm start <vmid>"
Scenario 4: etcd Quorum Loss¶
If 2+ control plane nodes are down:
# Check etcd status
talosctl -n 172.16.1.50 etcd members
# If quorum lost, need to recover from snapshot
talosctl -n <healthy-node> etcd snapshot db.snapshot
# Then restore following Talos docs
Scenario 5: Node Needs Replacement¶
# Drain the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Remove from cluster
kubectl delete node <node-name>
# If control plane, remove from etcd
talosctl -n <other-cp> etcd remove-member <node-id>
# Recreate VM and re-apply config
talosctl apply-config --nodes <new-ip> --file <config.yaml>
Post-Recovery¶
-
Verify node is
Ready: -
Check pods rescheduled:
-
Verify cluster health:
Prevention¶
- Enable HA for critical workloads
- Use PodDisruptionBudgets
- Regular etcd snapshots
- Monitor node health with Prometheus