Debugging a 500 Internal Server Error in Kubernetes: A Detective Story
Sometimes the best debugging sessions are the ones that take you down unexpected paths. What started as a simple “500 ISE” turned into a deep dive through Kubernetes probes, node resource allocation, and graceful shutdown mechanics. Here’s how it unfolded.
The Problem
Our self-hosted Langfuse instance on Google Kubernetes Engine was throwing 500 Internal Server Errors. Users couldn’t access the platform, and the pod was stuck in a crash loop.
langfuse-web-767f9598c-dsxfl 0/1 CrashLoopBackOff 19 (5m ago) 55m
18 restarts in under an hour. Something was very wrong.
Following the Trail
Step 1: Check the Obvious
First, let’s see what the pod is telling us:
kubectl logs -n langfuse langfuse-web-767f9598c-dsxfl --tail=50
The logs showed the app starting up fine:
✓ Ready in 14.7s
2025-12-23T04:38:52.197Z info MCP feature registered: prompts
Signal: SIGTERM
SIGTERM / SIGINT received. Shutting down in 110 seconds.
Wait - the app starts successfully, then immediately receives SIGTERM? That’s Kubernetes killing the container. But why?
Step 2: Dig into the Events
kubectl describe pod -n langfuse langfuse-web-767f9598c-dsxfl
The events revealed the culprit:
Warning Unhealthy Liveness probe failed: context deadline exceeded
Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 500
Two different failures:
- Liveness probe timing out (taking longer than 5 seconds)
- Readiness probe returning 500
Step 3: Test the Endpoints Directly
Let’s see what these endpoints actually return:
kubectl exec -n langfuse langfuse-web-767f9598c-dsxfl -- \
wget -qO- http://localhost:3000/api/public/health
# Output: {"status":"OK","version":"3.137.0"}
Health endpoint works! What about ready?
kubectl exec -n langfuse langfuse-web-767f9598c-dsxfl -- \
wget -qO- http://localhost:3000/api/public/ready
# Output: HTTP/1.1 500 Internal Server Error
The health check passes but the readiness check fails with 500. Why would they behave differently?
Step 4: Read the Source Code
A quick look at Langfuse’s ready.ts revealed the answer:
if (isSigtermReceived()) {
logger.info("Readiness check failed: SIGTERM / SIGINT received, shutting down.");
return res.status(500).json({
status: "SIGTERM / SIGINT received, shutting down",
version: VERSION.replace("v", ""),
});
}
The ready endpoint checks if the app has received a termination signal. If yes, it returns 500 to tell Kubernetes “don’t send me traffic, I’m shutting down.”
This created a vicious cycle:
- Liveness probe times out
- Kubernetes sends SIGTERM
- App sets
sigtermReceived = true - Ready endpoint returns 500
- Container gets killed and restarted
- Repeat
The root cause wasn’t the 500 - it was the liveness probe timing out in the first place.
Step 5: Why is the Liveness Probe Timing Out?
Time to check resource allocation:
kubectl describe node gk3-langfuse-pool-2-2b69afef-8sqn | grep -A 5 "Allocated resources"
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 15327m (96%) 23960m (150%)
memory 61251344256 (99%) 60655043072 (99%)
The node was 99% memory allocated and 150% CPU overcommitted. The langfuse-web pod was being starved of resources, causing it to respond slowly to health checks.
The Fix
Immediate Relief: Move the Pod
First, I needed to get the pod onto a less loaded node. This is where I discovered two useful kubectl commands:
kubectl cordon - Mark a node as unschedulable. New pods won’t be placed here, but existing pods keep running.
kubectl cordon gk3-langfuse-pool-2-2b69afef-8sqn
# node/gk3-langfuse-pool-2-2b69afef-8sqn cordoned
kubectl uncordon - Remove the unschedulable mark. The node can accept new pods again.
kubectl uncordon gk3-langfuse-pool-2-2b69afef-8sqn
# node/gk3-langfuse-pool-2-2b69afef-8sqn uncordoned
After cordoning the overloaded node, I deleted the pod to force rescheduling:
kubectl delete pod -n langfuse langfuse-web-767f9598c-dsxfl
GKE’s autoscaler kicked in and created a fresh node with plenty of resources.
But the Problem Persisted
Even on the new node, probes were still timing out. The 5-second timeout simply wasn’t enough for a Next.js app that takes ~17 seconds to fully initialize.
The original probe configuration:
initialDelaySeconds: 20(barely enough time)timeoutSeconds: 5(too aggressive)failureThreshold: 3(not enough buffer)
The Real Fix: Adjust Probe Settings
kubectl patch deployment -n langfuse langfuse-web --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds", "value": 60},
{"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/timeoutSeconds", "value": 15},
{"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/failureThreshold", "value": 5},
{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds", "value": 30},
{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value": 15},
{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/failureThreshold", "value": 5}
]'
After the rollout:
kubectl get pods -n langfuse -l app=web
# NAME READY STATUS RESTARTS AGE
# langfuse-web-5b5db7d8bb-wb78t 1/1 Running 0 69s
Finally, 1/1 Ready!
Making it Permanent
To persist this fix through future Terraform applies, I added the probe settings to the Helm values:
langfuse:
livenessProbe:
initialDelaySeconds: 60
timeoutSeconds: 15
failureThreshold: 5
readinessProbe:
initialDelaySeconds: 30
timeoutSeconds: 15
failureThreshold: 5
Key Takeaways
1. The 500 Error Was a Symptom, Not the Cause
The actual error message was misleading. The 500 from /api/public/ready was the app correctly reporting “I’m shutting down” - the real problem was why it was shutting down in the first place.
2. Probe Timeouts Need to Match Your App’s Reality
Default probe settings (5s timeout) work for lightweight apps, but heavier frameworks like Next.js need more breathing room, especially during startup.
3. Node Resource Pressure is Silent but Deadly
The pod wasn’t OOMKilled or showing obvious resource errors. It was just… slow. Slow enough that health checks failed. Always check kubectl describe node when debugging mysterious slowness.
4. Cordon and Uncordon are Your Friends
These commands let you control pod placement without disrupting running workloads:
# Stop new pods from being scheduled on a node
kubectl cordon <node-name>
# Allow scheduling again
kubectl uncordon <node-name>
# Drain a node (cordon + evict all pods)
kubectl drain <node-name> --ignore-daemonsets
5. Always Check the Source Code
When debugging why an endpoint returns an unexpected status code, reading the actual implementation beats guessing every time.
The Debugging Checklist
For future reference, when a Kubernetes pod is crash-looping:
- Check logs:
kubectl logs <pod> --previous - Check events:
kubectl describe pod <pod> - Test endpoints directly:
kubectl exec <pod> -- curl localhost:port/endpoint - Check node resources:
kubectl describe node <node-name> - Check pod resource requests/limits: Are they reasonable?
- Review probe settings: Are timeouts appropriate for your app?
Sometimes the fix is simple, but finding it requires following the breadcrumbs wherever they lead.