Diagnosing .NET Container Crashes in Kubernetes
From Logs to Liveness Probes: A Complete Troubleshooting Guide
Diagnosing container crashes in Kubernetes can be challenging, especially when .NET applications fail silently or enter crash loops. This comprehensive guide walks you through real-world debugging scenarios, covering container lifecycle, health probes, and systematic troubleshooting approaches.
You'll learn how to leverage Kubernetes events, container logs, health probes, and advanced diagnostic tools to quickly identify and resolve issues in your .NET containerized applications.
๐ 1. Introduction
⚠️ The Problem
.NET containers crashing silently in Kubernetes environments is a common challenge that can be difficult to diagnose. Unlike traditional application debugging, containerized environments require a different approach to troubleshooting.
Why traditional logging isn't enough:
- Logs may be lost when containers restart
- Startup failures occur before logging is configured
- Kubernetes events provide crucial context that logs alone don't capture
- Health probe failures can mask underlying application issues
๐ฏ What You'll Learn
This guide provides an end-to-end diagnosis workflow:
- Interpreting Kubernetes pod states and events
- Extracting meaningful information from container logs
- Configuring and troubleshooting health probes
- Resolving CrashLoopBackOff scenarios
- Using advanced diagnostic tools for live inspection
- Implementing preventive measures and best practices
๐ฅ 2. Common Crash Scenarios in .NET Containers
Understanding common failure patterns helps you quickly identify root causes. Here are the most frequent scenarios:
๐ Native Library Mismatches
Missing or incorrect native dependencies can cause immediate container failures:
- Missing .so files: Linux native libraries not included in the container image
- Incorrect RID targeting: Runtime Identifier (RID) mismatch between build and runtime environments
- Architecture mismatches: x64 vs ARM64 incompatibilities
Example Error:
// System.DllNotFoundException or System.TypeLoadException
Unhandled exception. System.DllNotFoundException:
Unable to load DLL 'libgdiplus.so' or one of its dependencies
๐ Startup Exceptions
Configuration errors often manifest during application startup:
- Misconfigured environment variables: Missing or incorrect values
- Missing secrets: Kubernetes secrets not mounted or accessible
- Database connection failures: Connection strings or network issues
- Invalid configuration files: JSON or XML parsing errors
๐พ Memory Issues
Memory-related problems can cause containers to be terminated:
- Memory leaks: Gradual memory consumption increase
- OOMKilled events: Containers exceeding memory limits
- Insufficient memory requests: Containers not allocated enough memory
๐ฅ Health Probe Misconfigurations
Improperly configured probes can cause false failures:
- Readiness probe failures: Pods marked as not ready, blocking traffic
- Liveness probe failures: Pods being restarted unnecessarily
- Wrong probe endpoints: Incorrect paths or ports
- Timeout issues: Probes timing out before application is ready
๐ Network and DNS Failures
Network-related issues during service bootstrapping:
- DNS resolution failures: Service names not resolving
- Network policy blocking connections
- Service discovery issues
๐ 3. Initial Triage: Kubernetes Events and Pod Status
Start your diagnosis by examining pod status and Kubernetes events. These provide the highest-level view of what's happening.
๐ Inspecting Pod Status
Use kubectl describe to get detailed information about a pod:
# Describe a specific pod
kubectl describe pod <pod-name> -n <namespace>
# Key fields to examine:
# - State: Current container state
# - Last State: Previous container state
# - Exit Code: Process exit code
# - Reason: Termination reason
# - Restart Count: Number of restarts
๐ก What to Look For:
- Exit Code 0: Normal termination (may indicate application logic issue)
- Exit Code 1-255: Application error (check logs)
- OOMKilled: Out of memory (increase limits or fix memory leak)
- Error: Container runtime error
- CrashLoopBackOff: Pod restarting repeatedly
๐ Examining Events Timeline
Events provide a chronological view of pod lifecycle:
# Get events sorted by creation timestamp
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
# Filter events for a specific pod
kubectl get events --field-selector involvedObject.name=<pod-name> \
--sort-by=.metadata.creationTimestamp -n <namespace>
# Save events to a file for postmortem analysis
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace> \
> events-$(date +%Y%m%d-%H%M%S).log
⚠️ Audit Tip: Always pipe output to timestamped logs for postmortem analysis. Events are ephemeral and may be lost if the cluster is restarted or events are pruned.
๐ 4. Deep Dive into Container Logs
Container logs are your primary source of application-level errors. Here's how to extract maximum value from them.
๐ Retrieving Logs
Basic log retrieval commands:
# Get logs from a pod
kubectl logs <pod-name> -n <namespace>
# Get logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>
# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>
# Get logs from previous container instance
kubectl logs <pod-name> --previous -n <namespace>
# Get last 100 lines
kubectl logs --tail=100 <pod-name> -n <namespace>
๐ณ Handling Multi-Container Pods
When pods contain multiple containers, you need to specify which container's logs to retrieve:
# List containers in a pod
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
# Get logs from each container
for container in $(kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'); do
echo "=== Logs from $container ==="
kubectl logs <pod-name> -c $container -n <namespace>
done
๐ Using .NET Structured Logging
Leverage structured logging for better diagnostics. Configure Serilog or Microsoft.Extensions.Logging to output JSON:
// Program.cs - Configure JSON logging
using Microsoft.Extensions.Logging;
var builder = WebApplication.CreateBuilder(args);
// Configure structured JSON logging
builder.Logging.ClearProviders();
builder.Logging.AddConsole();
builder.Logging.AddJsonConsole(options =>
{
options.JsonWriterOptions = new System.Text.Json.JsonWriterOptions
{
Indented = true
};
});
// Or with Serilog
// builder.Host.UseSerilog((context, config) =>
// {
// config.WriteTo.Console(new JsonFormatter());
// });
var app = builder.Build();
app.Run();
๐ Decoding Common .NET Exceptions
Understanding exception types helps identify root causes:
System.TypeLoadException:
- Missing assembly or version mismatch
- Check NuGet package versions and dependencies
- Verify all DLLs are included in the container image
System.DllNotFoundException:
- Missing native library (.so on Linux, .dll on Windows)
- Verify RID targeting matches container architecture
- Check if native dependencies are included in the image
System.Net.Http.HttpRequestException:
- Network connectivity issues
- DNS resolution problems
- Service endpoint not available
๐ฅ 5. Probes: Readiness vs Liveness
Health probes are critical for Kubernetes to understand your application's state. Misconfiguration can cause unnecessary restarts or traffic routing issues.
๐ Definitions and Differences
Readiness Probe:
- Determines if a pod is ready to receive traffic
- If it fails, the pod is removed from Service endpoints
- Does not restart the pod
- Use when the app needs time to initialize (database connections, cache warming, etc.)
Liveness Probe:
- Determines if the application is running correctly
- If it fails, Kubernetes restarts the pod
- Use to detect deadlocks or hung applications
- Should be more lenient than readiness probe
⚙️ How Misconfigured Probes Cause Issues
- Too aggressive liveness probe: Restarts healthy pods unnecessarily
- Too strict readiness probe: Pods never become ready, blocking all traffic
- Wrong timeout values: Probes fail even when the app is healthy
- Incorrect endpoint paths: Probes always fail
๐ YAML Configuration Examples
Proper probe configuration in a Kubernetes deployment:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-net-app
spec:
template:
spec:
containers:
- name: my-app
image: myregistry/my-net-app:latest
ports:
- containerPort: 8080
# Readiness probe - checks if app is ready for traffic
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Liveness probe - checks if app is alive
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
๐ป Implementing Health Check Endpoints in .NET
Create health check endpoints in your .NET application:
// Program.cs
var builder = WebApplication.CreateBuilder(args);
// Add health checks
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddDbContextCheck<MyDbContext>()
.AddUrlGroup(new Uri("http://external-service/health"), "external-api");
var app = builder.Build();
// Readiness endpoint - checks if app is ready
app.MapHealthChecks("/healthz/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready")
});
// Liveness endpoint - checks if app is alive
app.MapHealthChecks("/healthz/live", new HealthCheckOptions
{
Predicate = _ => false // Only checks if the app is running
});
app.Run();
๐ก Audit Strategy: Log probe responses and status codes to understand probe behavior. Add middleware to log health check requests:
// Log health check requests
app.Use(async (context, next) =>
{
if (context.Request.Path.StartsWithSegments("/healthz"))
{
var logger = context.RequestServices.GetRequiredService<ILogger<Program>>();
logger.LogInformation("Health check: {Path} from {Ip}",
context.Request.Path, context.Connection.RemoteIpAddress);
}
await next();
});
๐ 6. CrashLoopBackOff: Root Cause and Recovery
CrashLoopBackOff is a common state indicating a pod is restarting repeatedly. Understanding its mechanics helps you resolve issues quickly.
๐ What CrashLoopBackOff Means
When a pod fails repeatedly, Kubernetes implements an exponential backoff strategy:
- Initial restart: Immediate
- First backoff: 10 seconds
- Subsequent backoffs: 20s, 40s, 80s, 160s (capped at 300s)
- Maximum wait: 5 minutes between restart attempts
⚠️ Interpreting Backoff Timings: Longer backoff periods indicate the pod has been failing for an extended period. Check the restart count and recent events to understand the failure pattern.
๐ง Strategies for Resolution
1. Increase initialDelaySeconds
If your application needs more time to start, increase the initial delay:
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 30 # Increased from 10
periodSeconds: 10
2. Add Retry Logic in App Startup
Implement retry logic for external dependencies:
// Program.cs - Retry logic for database connection
var retryPolicy = Policy
.Handle<SqlException>()
.WaitAndRetryAsync(
retryCount: 5,
sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetry: (exception, timeSpan, retryCount, context) =>
{
logger.LogWarning("Retry {RetryCount} after {Delay}s", retryCount, timeSpan.TotalSeconds);
});
await retryPolicy.ExecuteAsync(async () =>
{
// Initialize database connection
await dbContext.Database.EnsureCreatedAsync();
});
3. Use postStart Lifecycle Hooks
Perform initialization tasks after container start:
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- sleep 10 && echo "Initialization complete"
๐ Auto-Detect CrashLoopBackOff Pods
Use this script to quickly identify all pods in CrashLoopBackOff state:
#!/bin/bash
# Find all CrashLoopBackOff pods
kubectl get pods --all-namespaces | grep CrashLoopBackOff
# Get detailed information for each
for pod in $(kubectl get pods --all-namespaces \
-o jsonpath='{range .items[?(@.status.containerStatuses[*].state.waiting.reason=="CrashLoopBackOff")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'); do
namespace=$(echo $pod | cut -f1)
name=$(echo $pod | cut -f2)
echo "=== $namespace/$name ==="
kubectl describe pod $name -n $namespace | grep -A 5 "State:"
kubectl logs $name -n $namespace --tail=20
echo ""
done
๐ฌ 7. Advanced Diagnostics
When standard logs and events aren't enough, use advanced diagnostic techniques to inspect running containers.
⚡ Using kubectl exec for Live Inspection
Execute commands inside running containers:
# Open an interactive shell
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Or for bash
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
# Run a specific command
kubectl exec <pod-name> -n <namespace> -- ps aux
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- ls -la /app
๐ณ Mounting Debug Sidecars
Add a debug container to your pod for troubleshooting:
spec:
containers:
- name: my-app
image: my-app:latest
- name: debug
image: busybox:latest
command: ["sleep", "3600"]
volumeMounts:
- name: app-volume
mountPath: /shared
๐ Inspecting System Directories
Examine system files for additional context:
# Check process information
kubectl exec <pod-name> -- cat /proc/1/status
# View environment variables
kubectl exec <pod-name> -- env | sort
# Check mounted volumes
kubectl exec <pod-name> -- mount
# Inspect network configuration
kubectl exec <pod-name> -- cat /etc/resolv.conf
๐ ️ .NET-Specific Diagnostic Tools
Use .NET diagnostic tools inside containers:
dotnet-dump
Capture memory dumps for analysis:
# Install dotnet-dump in container
kubectl exec <pod-name> -- dotnet tool install -g dotnet-dump
# Capture a dump
kubectl exec <pod-name> -- dotnet-dump collect -p 1
# Copy dump out of container
kubectl cp <namespace>/<pod-name>:/tmp/core_*.dmp ./core.dmp
dotnet-trace
Collect tracing information:
# Install dotnet-trace
kubectl exec <pod-name> -- dotnet tool install -g dotnet-trace
# Collect trace
kubectl exec <pod-name> -- dotnet-trace collect -p 1 --format speedscope
dotnet-counters
Monitor performance counters:
# Monitor counters in real-time
kubectl exec <pod-name> -- dotnet-counters monitor -p 1 \
--counters System.Runtime,Microsoft.AspNetCore.Hosting
๐ก️ 8. Preventive Measures and Best Practices
Preventing issues is better than diagnosing them. Implement these practices to reduce container crash incidents.
๐ฅ Use Health Check Endpoints in .NET
Always implement comprehensive health checks:
// Program.cs - Comprehensive health checks
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddCheck<DatabaseHealthCheck>("database")
.AddCheck<CacheHealthCheck>("cache")
.AddCheck<ExternalApiHealthCheck>("external-api");
// Custom health check implementation
public class DatabaseHealthCheck : IHealthCheck
{
private readonly MyDbContext _dbContext;
public DatabaseHealthCheck(MyDbContext dbContext)
{
_dbContext = dbContext;
}
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
await _dbContext.Database.CanConnectAsync(cancellationToken);
return HealthCheckResult.Healthy();
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy("Database connection failed", ex);
}
}
}
✅ Validate Native Dependencies During CI
Add checks to your CI/CD pipeline:
# .github/workflows/validate-native-deps.yml
name: Validate Native Dependencies
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build and test
run: |
dotnet build
dotnet test
- name: Check RID targeting
run: |
dotnet publish -r linux-x64 --self-contained
ldd ./bin/Release/net9.0/linux-x64/publish/MyApp | grep "not found" && exit 1 || exit 0
๐งช Container Startup Smoke Tests
Test containers before deployment:
# smoke-test.sh
#!/bin/bash
set -e
IMAGE=$1
PORT=8080
# Start container
CONTAINER_ID=$(docker run -d -p $PORT:8080 $IMAGE)
# Wait for startup
sleep 10
# Test health endpoint
curl -f http://localhost:$PORT/healthz/ready || exit 1
# Test liveness endpoint
curl -f http://localhost:$PORT/healthz/live || exit 1
# Cleanup
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID
echo "Smoke tests passed!"
☁️ Infrastructure as Code Integration
Use Terraform or Helm to inject probe configurations:
# terraform - kubernetes_deployment.tf
resource "kubernetes_deployment" "app" {
metadata {
name = "my-app"
}
spec {
template {
spec {
container {
name = "my-app"
image = "my-app:${var.image_tag}"
liveness_probe {
http_get {
path = "/healthz/live"
port = 8080
}
initial_delay_seconds = var.liveness_initial_delay
period_seconds = var.liveness_period
}
readiness_probe {
http_get {
path = "/healthz/ready"
port = 8080
}
initial_delay_seconds = var.readiness_initial_delay
period_seconds = var.readiness_period
}
}
}
}
}
}
๐ก Best Practices Summary:
- Always implement health check endpoints
- Test native dependencies in CI/CD
- Run smoke tests before deployment
- Use structured logging for better observability
- Configure appropriate resource limits and requests
- Monitor probe response times and success rates
- Document troubleshooting procedures for common issues
๐ 9. Conclusion
Diagnosing .NET container crashes in Kubernetes requires a systematic approach that combines multiple diagnostic techniques. By following the workflow outlined in this guide, you can quickly identify and resolve issues.
๐ Recap of Diagnostic Flow
- Initial Triage: Check pod status and Kubernetes events
- Container Logs: Examine logs for application-level errors
- Health Probes: Verify probe configuration and endpoints
- CrashLoopBackOff: Understand backoff mechanics and apply appropriate fixes
- Advanced Diagnostics: Use exec and diagnostic tools for deeper inspection
- Prevention: Implement health checks, validation, and monitoring
๐ก Key Takeaways:
- Always start with
kubectl describe podand events - Use structured logging for better diagnostics
- Configure health probes appropriately - don't be too aggressive
- Implement retry logic for external dependencies
- Test containers locally before deployment
- Monitor and log probe responses
๐ Next Steps
Continue improving your container debugging skills:
- Set up comprehensive monitoring and alerting
- Create diagnostic scripts for common issues
- Document your troubleshooting procedures
- Share knowledge with your team
- Contribute diagnostic tools and scripts to open-source projects
๐ Resources:
- Kubernetes Documentation: Configure Liveness, Readiness and Startup Probes
- .NET Health Checks: ASP.NET Core Health Checks
- .NET Diagnostics: .NET Diagnostics Documentation
No comments:
Post a Comment