Kubernetes OOM Killer:
Root Cause & Fix Guide

P1 — High Impact Category: Kubernetes / Memory Exit code: 137 Avg diagnosis time: 35 min manual → seconds with OM
🚨 Currently hitting this? Paste your logs for instant root cause.
Free · No signup · Webhook logs never stored
⚡ Analyze now →

01What is the Kubernetes OOM Killer?

The Linux OOM (Out of Memory) Killer is invoked by the kernel when a container exceeds its memory limit. Kubernetes enforces container memory limits via cgroups — when a container hits its limit, the kernel terminates it with SIGKILL (exit code 137).

This is not a crash. It's an intentional termination. The container is killed, Kubernetes restarts it, and if the root cause isn't fixed, you get a CrashLoopBackOff.

02Symptoms to look for

🔴
Exit code 137 in pod logs
The definitive indicator. Exit 137 = SIGKILL from OOM. No other common exit code means this.
🔴
CrashLoopBackOff status
Pod repeatedly killed and restarted. kubectl get pods shows CrashLoopBackOff.
🟡
OOMKilled in kubectl describe pod
Last State shows OOMKilled: true in the container status section.
🟡
Memory usage at or near limit
Grafana/Datadog shows container memory at 95-100% of limit before each kill.
🟠
Kernel log: oom_kill_process
Node-level: dmesg or /var/log/kern.log shows oom_kill_process for the container PID.

03Exact diagnostic commands

Step 1 — Confirm OOMKilled

# Check pod status and last termination reason
kubectl describe pod <pod-name> -n <namespace>

# Look for this in the output:
Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137

# Quick check across all pods in namespace
kubectl get pods -n <namespace> -o json | \
  jq '.items[].status.containerStatuses[].lastState.terminated | select(.reason=="OOMKilled")'

Step 2 — Check current memory limits vs usage

# See current limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'

# See live memory usage
kubectl top pod <pod-name> --containers

# Watch memory in real-time (if it's happening now)
watch -n 2 kubectl top pod <pod-name>

Step 3 — Check node-level OOM events

# Get the node the pod ran on
kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}'

# Check kernel OOM logs on that node
kubectl debug node/<node-name> -it --image=ubuntu -- dmesg | grep -i oom

# Or via node shell
ssh <node-ip> "dmesg | grep -i 'oom_kill\|Out of memory'"

04Root causes — ranked by frequency

1. Memory limit set too low (most common — ~60%)

The container's memory limit was set conservatively during initial deployment and never updated as the application's actual memory footprint grew.

# Your limit vs actual usage
kubectl get pod <pod-name> -o yaml | grep -A3 resources:
# If usage regularly hits 80%+ of limit — limit is too low

# Fix: increase memory limit
kubectl set resources deployment/<name> \
  --limits=memory=1Gi \
  --requests=memory=512Mi

2. Memory leak in application (~25%)

Application allocates memory and never releases it. Memory grows over time until the container is killed. Pattern: memory climbs steadily, pod restarts, climbs again.

# Check memory trend over time in Prometheus/Grafana
container_memory_usage_bytes{pod=~"<pod-name>.*"}

# If memory grows linearly over hours → leak
# Take heap dump while running (Java example)
kubectl exec <pod-name> -- \
  jmap -dump:format=b,file=/tmp/heap.hprof <java-pid>

kubectl cp <pod-name>:/tmp/heap.hprof ./heap.hprof
# Analyze with Eclipse MAT or VisualVM

3. Sudden traffic spike (~10%)

Normal memory usage, but a traffic surge created more concurrent requests than the container's memory could handle. Often correlation with traffic spike in APM.

# Check if traffic spiked before the kill
# Datadog query:
sum:nginx.net.connections{*} by {pod}

# Fix: HPA to scale out before memory is exhausted
kubectl autoscale deployment <name> \
  --min=2 --max=10 \
  --cpu-percent=70

4. Node memory pressure eviction (~5%)

The node itself is running low on memory. Kubernetes evicts pods to free node memory even if the pod hasn't hit its own limit.

# Check node conditions
kubectl describe node <node-name> | grep -A5 Conditions

# Look for:
MemoryPressure   True

# Check node memory
kubectl top nodes

05OperatorMesh analysis — real example

⚡ OperatorMesh Triage Output

Input:

service: auth-service
error: CrashLoopBackOff — pod restarting every 30s
logs: OOMKilled exit code 137, memory limit 512Mi, usage 511Mi
recent changes: no recent deploys, increased traffic 3x this morning

Root cause identified:

OOMKilled — container memory limit too low for current traffic load. Memory limit 512Mi was set for baseline traffic but 3x traffic surge tripled concurrent request memory.

85%
Diagnosis confidence
76%
Fix confidence

Ranked actions:

Investigate first: Memory consumption pattern since traffic spike — is it linear (memory leak) or proportional to traffic (limit too low)?

Rejected hypotheses:

06Immediate fix checklist

# 1. Increase memory limit (immediate — no downtime)
kubectl patch deployment <name> -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"},"requests":{"memory":"1Gi"}}}]}}}}'

# 2. Verify pod comes up cleanly
kubectl rollout status deployment/<name>
kubectl get pods -w

# 3. Monitor memory after fix
watch kubectl top pod -l app=<name>

# 4. Set VPA for automatic management (long-term)
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: <name>-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: <name>
  updatePolicy:
    updateMode: "Auto"
EOF

07Prevention

🚨 Hitting a Kubernetes OOM issue right now?
Paste your kubectl describe pod output and get root cause + confidence scores + ranked fixes in seconds. Free — no signup needed.
⚡ Analyze my Kubernetes incident →

Related incident patterns