Diagnosing container crashes in Kubernetes can be challenging, especially when .NET applications fail silently or enter crash loops. This comprehensive guide walks you through real-world debugging scenarios, covering container lifecycle, health probes, and systematic troubleshooting approaches.
You'll learn how to leverage Kubernetes events, container logs, health probes, and advanced diagnostic tools to quickly identify and resolve issues in your .NET containerized applications.
🔍 1. Introduction
⚠️ The Problem
.NET containers crashing silently in Kubernetes environments is a common challenge that can be difficult to diagnose. Unlike traditional application debugging, containerized environments require a different approach to troubleshooting.
Why traditional logging isn't enough:
- Logs may be lost when containers restart
- Startup failures occur before logging is configured
- Kubernetes events provide crucial context that logs alone don't capture
- Health probe failures can mask underlying application issues
🎯 What You'll Learn
This guide provides an end-to-end diagnosis workflow:
- Interpreting Kubernetes pod states and events
- Extracting meaningful information from container logs
- Configuring and troubleshooting health probes
- Resolving CrashLoopBackOff scenarios
- Using advanced diagnostic tools for live inspection
- Implementing preventive measures and best practices
💥 2. Common Crash Scenarios in .NET Containers
Understanding common failure patterns helps you quickly identify root causes. Here are the most frequent scenarios:
📚 Native Library Mismatches
Missing or incorrect native dependencies can cause immediate container failures:
- Missing .so files: Linux native libraries not included in the container image
- Incorrect RID targeting: Runtime Identifier (RID) mismatch between build and runtime environments
- Architecture mismatches: x64 vs ARM64 incompatibilities
Example Error:
Unhandled exception. System.DllNotFoundException:
Unable to load DLL 'libgdiplus.so' or one of its dependencies
🚀 Startup Exceptions
Configuration errors often manifest during application startup:
- Misconfigured environment variables: Missing or incorrect values
- Missing secrets: Kubernetes secrets not mounted or accessible
- Database connection failures: Connection strings or network issues
- Invalid configuration files: JSON or XML parsing errors
💾 Memory Issues
Memory-related problems can cause containers to be terminated:
- Memory leaks: Gradual memory consumption increase
- OOMKilled events: Containers exceeding memory limits
- Insufficient memory requests: Containers not allocated enough memory
🏥 Health Probe Misconfigurations
Improperly configured probes can cause false failures:
- Readiness probe failures: Pods marked as not ready, blocking traffic
- Liveness probe failures: Pods being restarted unnecessarily
- Wrong probe endpoints: Incorrect paths or ports
- Timeout issues: Probes timing out before application is ready
🌐 Network and DNS Failures
Network-related issues during service bootstrapping:
- DNS resolution failures: Service names not resolving
- Network policy blocking connections
- Service discovery issues
📊 3. Initial Triage: Kubernetes Events and Pod Status
Start your diagnosis by examining pod status and Kubernetes events. These provide the highest-level view of what's happening.
🔎 Inspecting Pod Status
Use kubectl describe to get detailed information about a pod:
kubectl describe pod <pod-name> -n <namespace>
💡 What to Look For:
- Exit Code 0: Normal termination (may indicate application logic issue)
- Exit Code 1-255: Application error (check logs)
- OOMKilled: Out of memory (increase limits or fix memory leak)
- Error: Container runtime error
- CrashLoopBackOff: Pod restarting repeatedly
📅 Examining Events Timeline
Events provide a chronological view of pod lifecycle:
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
kubectl get events --field-selector involvedObject.name=<pod-name> \
--sort-by=.metadata.creationTimestamp -n <namespace>
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace> \
> events-$(date +%Y%m%d-%H%M%S).log
⚠️ Audit Tip: Always pipe output to timestamped logs for postmortem analysis. Events are ephemeral and may be lost if the cluster is restarted or events are pruned.
📝 4. Deep Dive into Container Logs
Container logs are your primary source of application-level errors. Here's how to extract maximum value from them.
📋 Retrieving Logs
Basic log retrieval commands:
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -c <container-name> -n <namespace>
kubectl logs -f <pod-name> -n <namespace>
kubectl logs <pod-name> --previous -n <namespace>
kubectl logs --tail=100 <pod-name> -n <namespace>
🐳 Handling Multi-Container Pods
When pods contain multiple containers, you need to specify which container's logs to retrieve:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
for container in $(kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'); do
echo "=== Logs from $container ==="
kubectl logs <pod-name> -c $container -n <namespace>
done
📊 Using .NET Structured Logging
Leverage structured logging for better diagnostics. Configure Serilog or Microsoft.Extensions.Logging to output JSON:
using Microsoft.Extensions.Logging;
var builder = WebApplication.CreateBuilder(args);
builder.Logging.ClearProviders();
builder.Logging.AddConsole();
builder.Logging.AddJsonConsole(options =>
{
options.JsonWriterOptions = new System.Text.Json.JsonWriterOptions
{
Indented = true
};
});
var app = builder.Build();
app.Run();
🔍 Decoding Common .NET Exceptions
Understanding exception types helps identify root causes:
System.TypeLoadException:
- Missing assembly or version mismatch
- Check NuGet package versions and dependencies
- Verify all DLLs are included in the container image
System.DllNotFoundException:
- Missing native library (.so on Linux, .dll on Windows)
- Verify RID targeting matches container architecture
- Check if native dependencies are included in the image
System.Net.Http.HttpRequestException:
- Network connectivity issues
- DNS resolution problems
- Service endpoint not available
🏥 5. Probes: Readiness vs Liveness
Health probes are critical for Kubernetes to understand your application's state. Misconfiguration can cause unnecessary restarts or traffic routing issues.
📖 Definitions and Differences
Readiness Probe:
- Determines if a pod is ready to receive traffic
- If it fails, the pod is removed from Service endpoints
- Does not restart the pod
- Use when the app needs time to initialize (database connections, cache warming, etc.)
Liveness Probe:
- Determines if the application is running correctly
- If it fails, Kubernetes restarts the pod
- Use to detect deadlocks or hung applications
- Should be more lenient than readiness probe
⚙️ How Misconfigured Probes Cause Issues
- Too aggressive liveness probe: Restarts healthy pods unnecessarily
- Too strict readiness probe: Pods never become ready, blocking all traffic
- Wrong timeout values: Probes fail even when the app is healthy
- Incorrect endpoint paths: Probes always fail
📝 YAML Configuration Examples
Proper probe configuration in a Kubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-net-app
spec:
template:
spec:
containers:
- name: my-app
image: myregistry/my-net-app:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
💻 Implementing Health Check Endpoints in .NET
Create health check endpoints in your .NET application:
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddDbContextCheck<MyDbContext>()
.AddUrlGroup(new Uri("http://external-service/health"), "external-api");
var app = builder.Build();
app.MapHealthChecks("/healthz/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready")
});
app.MapHealthChecks("/healthz/live", new HealthCheckOptions
{
Predicate = _ => false
});
app.Run();
💡 Audit Strategy: Log probe responses and status codes to understand probe behavior. Add middleware to log health check requests:
app.Use(async (context, next) =>
{
if (context.Request.Path.StartsWithSegments("/healthz"))
{
var logger = context.RequestServices.GetRequiredService<ILogger<Program>>();
logger.LogInformation("Health check: {Path} from {Ip}",
context.Request.Path, context.Connection.RemoteIpAddress);
}
await next();
});
🔄 6. CrashLoopBackOff: Root Cause and Recovery
CrashLoopBackOff is a common state indicating a pod is restarting repeatedly. Understanding its mechanics helps you resolve issues quickly.
📚 What CrashLoopBackOff Means
When a pod fails repeatedly, Kubernetes implements an exponential backoff strategy:
- Initial restart: Immediate
- First backoff: 10 seconds
- Subsequent backoffs: 20s, 40s, 80s, 160s (capped at 300s)
- Maximum wait: 5 minutes between restart attempts
⚠️ Interpreting Backoff Timings: Longer backoff periods indicate the pod has been failing for an extended period. Check the restart count and recent events to understand the failure pattern.
🔧 Strategies for Resolution
1. Increase initialDelaySeconds
If your application needs more time to start, increase the initial delay:
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
2. Add Retry Logic in App Startup
Implement retry logic for external dependencies:
var retryPolicy = Policy
.Handle<SqlException>()
.WaitAndRetryAsync(
retryCount: 5,
sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetry: (exception, timeSpan, retryCount, context) =>
{
logger.LogWarning("Retry {RetryCount} after {Delay}s", retryCount, timeSpan.TotalSeconds);
});
await retryPolicy.ExecuteAsync(async () =>
{
await dbContext.Database.EnsureCreatedAsync();
});
3. Use postStart Lifecycle Hooks
Perform initialization tasks after container start:
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- sleep 10 && echo "Initialization complete"
🔍 Auto-Detect CrashLoopBackOff Pods
Use this script to quickly identify all pods in CrashLoopBackOff state:
kubectl get pods --all-namespaces | grep CrashLoopBackOff
for pod in $(kubectl get pods --all-namespaces \
-o jsonpath='{range .items[?(@.status.containerStatuses[*].state.waiting.reason=="CrashLoopBackOff")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'); do
namespace=$(echo $pod | cut -f1)
name=$(echo $pod | cut -f2)
echo "=== $namespace/$name ==="
kubectl describe pod $name -n $namespace | grep -A 5 "State:"
kubectl logs $name -n $namespace --tail=20
echo ""
done
🔬 7. Advanced Diagnostics
When standard logs and events aren't enough, use advanced diagnostic techniques to inspect running containers.
⚡ Using kubectl exec for Live Inspection
Execute commands inside running containers:
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
kubectl exec <pod-name> -n <namespace> -- ps aux
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- ls -la /app
🐳 Mounting Debug Sidecars
Add a debug container to your pod for troubleshooting:
spec:
containers:
- name: my-app
image: my-app:latest
- name: debug
image: busybox:latest
command: ["sleep", "3600"]
volumeMounts:
- name: app-volume
mountPath: /shared
📁 Inspecting System Directories
Examine system files for additional context:
kubectl exec <pod-name> -- cat /proc/1/status
kubectl exec <pod-name> -- env | sort
kubectl exec <pod-name> -- mount
kubectl exec <pod-name> -- cat /etc/resolv.conf
🛠️ .NET-Specific Diagnostic Tools
Use .NET diagnostic tools inside containers:
dotnet-dump
Capture memory dumps for analysis:
kubectl exec <pod-name> -- dotnet tool install -g dotnet-dump
kubectl exec <pod-name> -- dotnet-dump collect -p 1
kubectl cp <namespace>/<pod-name>:/tmp/core_*.dmp ./core.dmp
dotnet-trace
Collect tracing information:
kubectl exec <pod-name> -- dotnet tool install -g dotnet-trace
kubectl exec <pod-name> -- dotnet-trace collect -p 1 --format speedscope
dotnet-counters
Monitor performance counters:
kubectl exec <pod-name> -- dotnet-counters monitor -p 1 \
--counters System.Runtime,Microsoft.AspNetCore.Hosting
🛡️ 8. Preventive Measures and Best Practices
Preventing issues is better than diagnosing them. Implement these practices to reduce container crash incidents.
🏥 Use Health Check Endpoints in .NET
Always implement comprehensive health checks:
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddCheck<DatabaseHealthCheck>("database")
.AddCheck<CacheHealthCheck>("cache")
.AddCheck<ExternalApiHealthCheck>("external-api");
public class DatabaseHealthCheck : IHealthCheck
{
private readonly MyDbContext _dbContext;
public DatabaseHealthCheck(MyDbContext dbContext)
{
_dbContext = dbContext;
}
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
await _dbContext.Database.CanConnectAsync(cancellationToken);
return HealthCheckResult.Healthy();
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy("Database connection failed", ex);
}
}
}
✅ Validate Native Dependencies During CI
Add checks to your CI/CD pipeline:
name: Validate Native Dependencies
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build and test
run: |
dotnet build
dotnet test
- name: Check RID targeting
run: |
dotnet publish -r linux-x64 --self-contained
ldd ./bin/Release/net9.0/linux-x64/publish/MyApp | grep "not found" && exit 1 || exit 0
🧪 Container Startup Smoke Tests
Test containers before deployment:
#!/bin/bash
set -e
IMAGE=$1
PORT=8080
CONTAINER_ID=$(docker run -d -p $PORT:8080 $IMAGE)
sleep 10
curl -f http://localhost:$PORT/healthz/ready || exit 1
curl -f http://localhost:$PORT/healthz/live || exit 1
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID
echo "Smoke tests passed!"
☁️ Infrastructure as Code Integration
Use Terraform or Helm to inject probe configurations:
resource "kubernetes_deployment" "app" {
metadata {
name = "my-app"
}
spec {
template {
spec {
container {
name = "my-app"
image = "my-app:${var.image_tag}"
liveness_probe {
http_get {
path = "/healthz/live"
port = 8080
}
initial_delay_seconds = var.liveness_initial_delay
period_seconds = var.liveness_period
}
readiness_probe {
http_get {
path = "/healthz/ready"
port = 8080
}
initial_delay_seconds = var.readiness_initial_delay
period_seconds = var.readiness_period
}
}
}
}
}
}
💡 Best Practices Summary:
- Always implement health check endpoints
- Test native dependencies in CI/CD
- Run smoke tests before deployment
- Use structured logging for better observability
- Configure appropriate resource limits and requests
- Monitor probe response times and success rates
- Document troubleshooting procedures for common issues
📚 9. Conclusion
Diagnosing .NET container crashes in Kubernetes requires a systematic approach that combines multiple diagnostic techniques. By following the workflow outlined in this guide, you can quickly identify and resolve issues.
🔄 Recap of Diagnostic Flow
- Initial Triage: Check pod status and Kubernetes events
- Container Logs: Examine logs for application-level errors
- Health Probes: Verify probe configuration and endpoints
- CrashLoopBackOff: Understand backoff mechanics and apply appropriate fixes
- Advanced Diagnostics: Use exec and diagnostic tools for deeper inspection
- Prevention: Implement health checks, validation, and monitoring
💡 Key Takeaways:
- Always start with
kubectl describe pod and events
- Use structured logging for better diagnostics
- Configure health probes appropriately - don't be too aggressive
- Implement retry logic for external dependencies
- Test containers locally before deployment
- Monitor and log probe responses
🚀 Next Steps
Continue improving your container debugging skills:
- Set up comprehensive monitoring and alerting
- Create diagnostic scripts for common issues
- Document your troubleshooting procedures
- Share knowledge with your team
- Contribute diagnostic tools and scripts to open-source projects