Skip to content

Runbook: Certificate Issues

Symptoms

  • Browser shows certificate errors
  • NET::ERR_CERT_AUTHORITY_INVALID
  • Certificate has expired
  • Services unreachable via HTTPS

Quick Check

# Check certificate status
kubectl get certificates -A

# Check certificate details
kubectl describe certificate <name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=50

Diagnosis

1. Certificate Not Ready

# Check certificate status
kubectl get certificate -A
# Look for Ready=False

# Check events
kubectl describe certificate <name> -n <namespace>

2. Challenge Failed

# Check challenges
kubectl get challenges -A

# Check challenge details
kubectl describe challenge <name> -n <namespace>

3. Issuer Problems

# Check cluster issuers
kubectl get clusterissuer

# Check issuer status
kubectl describe clusterissuer letsencrypt-production

Recovery Procedures

Scenario 1: DNS Challenge Failing

# Check Cloudflare API token
kubectl get secret -n cert-manager cloudflare-api-token -o yaml

# Verify token has correct permissions:
# - Zone:DNS:Edit
# - Zone:Zone:Read

# Test DNS propagation
dig _acme-challenge.myapp.ragas.cc TXT

Scenario 2: Rate Limited

Let's Encrypt rate limits: - 50 certificates per week per domain - 5 failures per hour per account

# Check for rate limit errors
kubectl logs -n cert-manager -l app=cert-manager | grep -i "rate limit"

# Wait and retry, or use staging issuer

Scenario 3: Certificate Expired

# Delete and recreate certificate
kubectl delete certificate <name> -n <namespace>

# cert-manager will recreate automatically
# Or trigger reconciliation
kubectl annotate certificate <name> -n <namespace> \
  cert-manager.io/issuer-kind- \
  cert-manager.io/issuer-kind=ClusterIssuer

Scenario 4: Secret Missing

# Check if secret exists
kubectl get secret <cert-secret-name> -n <namespace>

# If missing, delete certificate to trigger recreation
kubectl delete certificate <name> -n <namespace>

Scenario 5: Wrong Certificate Served

# Check which cert the gateway is using
kubectl get gateway -n network -o yaml | grep secretName

# Verify certificate matches hostname
openssl s_client -connect myapp.ragas.cc:443 -servername myapp.ragas.cc 2>/dev/null | openssl x509 -noout -text | grep -A1 "Subject Alternative Name"

Manual Certificate Renewal

# Force renewal by deleting the secret
kubectl delete secret <cert-secret-name> -n <namespace>

# cert-manager will issue a new certificate
kubectl get certificate -n <namespace> -w

Using Staging Issuer

For testing, use Let's Encrypt staging:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: your@email.com
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

Prevention

  • Monitor certificate expiry with Prometheus
  • Set up alerts for expiring certificates
  • Use wildcard certificates to reduce rate limit risk
  • Test with staging before production