Diagnosing .NET Container Crashes in Kubernetes: From Logs to Liveness Probes

Diagnosing .NET Container Crashes in Kubernetes

From Logs to Liveness Probes: A Complete Troubleshooting Guide

Diagnosing container crashes in Kubernetes can be challenging, especially when .NET applications fail silently or enter crash loops. This comprehensive guide walks you through real-world debugging scenarios, covering container lifecycle, health probes, and systematic troubleshooting approaches.

You'll learn how to leverage Kubernetes events, container logs, health probes, and advanced diagnostic tools to quickly identify and resolve issues in your .NET containerized applications.

🔍 1. Introduction

⚠️ The Problem

.NET containers crashing silently in Kubernetes environments is a common challenge that can be difficult to diagnose. Unlike traditional application debugging, containerized environments require a different approach to troubleshooting.

Why traditional logging isn't enough:

Logs may be lost when containers restart
Startup failures occur before logging is configured
Kubernetes events provide crucial context that logs alone don't capture
Health probe failures can mask underlying application issues

🎯 What You'll Learn

This guide provides an end-to-end diagnosis workflow:

Interpreting Kubernetes pod states and events
Extracting meaningful information from container logs
Configuring and troubleshooting health probes
Resolving CrashLoopBackOff scenarios
Using advanced diagnostic tools for live inspection
Implementing preventive measures and best practices

💥 2. Common Crash Scenarios in .NET Containers

Understanding common failure patterns helps you quickly identify root causes. Here are the most frequent scenarios:

📚 Native Library Mismatches

Missing or incorrect native dependencies can cause immediate container failures:

Missing .so files: Linux native libraries not included in the container image
Incorrect RID targeting: Runtime Identifier (RID) mismatch between build and runtime environments
Architecture mismatches: x64 vs ARM64 incompatibilities

Example Error:

// System.DllNotFoundException or System.TypeLoadException
Unhandled exception. System.DllNotFoundException: 
Unable to load DLL 'libgdiplus.so' or one of its dependencies

🚀 Startup Exceptions

Configuration errors often manifest during application startup:

Misconfigured environment variables: Missing or incorrect values
Missing secrets: Kubernetes secrets not mounted or accessible
Database connection failures: Connection strings or network issues
Invalid configuration files: JSON or XML parsing errors

💾 Memory Issues

Memory-related problems can cause containers to be terminated:

Memory leaks: Gradual memory consumption increase
OOMKilled events: Containers exceeding memory limits
Insufficient memory requests: Containers not allocated enough memory

🏥 Health Probe Misconfigurations

Improperly configured probes can cause false failures:

Readiness probe failures: Pods marked as not ready, blocking traffic
Liveness probe failures: Pods being restarted unnecessarily
Wrong probe endpoints: Incorrect paths or ports
Timeout issues: Probes timing out before application is ready

🌐 Network and DNS Failures

Network-related issues during service bootstrapping:

DNS resolution failures: Service names not resolving
Network policy blocking connections
Service discovery issues

📊 3. Initial Triage: Kubernetes Events and Pod Status

Start your diagnosis by examining pod status and Kubernetes events. These provide the highest-level view of what's happening.

🔎 Inspecting Pod Status

Use kubectl describe to get detailed information about a pod:

# Describe a specific pod
kubectl describe pod <pod-name> -n <namespace>

# Key fields to examine:
# - State: Current container state
# - Last State: Previous container state
# - Exit Code: Process exit code
# - Reason: Termination reason
# - Restart Count: Number of restarts

💡 What to Look For:

Exit Code 0: Normal termination (may indicate application logic issue)
Exit Code 1-255: Application error (check logs)
OOMKilled: Out of memory (increase limits or fix memory leak)
Error: Container runtime error
CrashLoopBackOff: Pod restarting repeatedly

📅 Examining Events Timeline

Events provide a chronological view of pod lifecycle:

# Get events sorted by creation timestamp
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

# Filter events for a specific pod
kubectl get events --field-selector involvedObject.name=<pod-name> \
  --sort-by=.metadata.creationTimestamp -n <namespace>

# Save events to a file for postmortem analysis
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace> \
  > events-$(date +%Y%m%d-%H%M%S).log

⚠️ Audit Tip: Always pipe output to timestamped logs for postmortem analysis. Events are ephemeral and may be lost if the cluster is restarted or events are pruned.

📝 4. Deep Dive into Container Logs

Container logs are your primary source of application-level errors. Here's how to extract maximum value from them.

📋 Retrieving Logs

Basic log retrieval commands:

# Get logs from a pod
kubectl logs <pod-name> -n <namespace>

# Get logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>

# Get logs from previous container instance
kubectl logs <pod-name> --previous -n <namespace>

# Get last 100 lines
kubectl logs --tail=100 <pod-name> -n <namespace>

🐳 Handling Multi-Container Pods

When pods contain multiple containers, you need to specify which container's logs to retrieve:

# List containers in a pod
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].name}'

# Get logs from each container
for container in $(kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'); do
  echo "=== Logs from $container ==="
  kubectl logs <pod-name> -c $container -n <namespace>
done

📊 Using .NET Structured Logging

Leverage structured logging for better diagnostics. Configure Serilog or Microsoft.Extensions.Logging to output JSON:

// Program.cs - Configure JSON logging
using Microsoft.Extensions.Logging;

var builder = WebApplication.CreateBuilder(args);

// Configure structured JSON logging
builder.Logging.ClearProviders();
builder.Logging.AddConsole();
builder.Logging.AddJsonConsole(options =>
{
    options.JsonWriterOptions = new System.Text.Json.JsonWriterOptions
    {
        Indented = true
    };
});

// Or with Serilog
// builder.Host.UseSerilog((context, config) =>
// {
//     config.WriteTo.Console(new JsonFormatter());
// });

var app = builder.Build();
app.Run();

🔍 Decoding Common .NET Exceptions

Understanding exception types helps identify root causes:

System.TypeLoadException:

Missing assembly or version mismatch
Check NuGet package versions and dependencies
Verify all DLLs are included in the container image

System.DllNotFoundException:

Missing native library (.so on Linux, .dll on Windows)
Verify RID targeting matches container architecture
Check if native dependencies are included in the image

System.Net.Http.HttpRequestException:

Network connectivity issues
DNS resolution problems
Service endpoint not available

🏥 5. Probes: Readiness vs Liveness

Health probes are critical for Kubernetes to understand your application's state. Misconfiguration can cause unnecessary restarts or traffic routing issues.

📖 Definitions and Differences

Readiness Probe:

Determines if a pod is ready to receive traffic
If it fails, the pod is removed from Service endpoints
Does not restart the pod
Use when the app needs time to initialize (database connections, cache warming, etc.)

Liveness Probe:

Determines if the application is running correctly
If it fails, Kubernetes restarts the pod
Use to detect deadlocks or hung applications
Should be more lenient than readiness probe

⚙️ How Misconfigured Probes Cause Issues

Too aggressive liveness probe: Restarts healthy pods unnecessarily
Too strict readiness probe: Pods never become ready, blocking all traffic
Wrong timeout values: Probes fail even when the app is healthy
Incorrect endpoint paths: Probes always fail

📝 YAML Configuration Examples

Proper probe configuration in a Kubernetes deployment:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-net-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: myregistry/my-net-app:latest
        ports:
        - containerPort: 8080
        
        # Readiness probe - checks if app is ready for traffic
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        
        # Liveness probe - checks if app is alive
        livenessProbe:
          httpGet:
            path: /healthz/live
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

💻 Implementing Health Check Endpoints in .NET

Create health check endpoints in your .NET application:

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// Add health checks
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddDbContextCheck<MyDbContext>()
    .AddUrlGroup(new Uri("http://external-service/health"), "external-api");

var app = builder.Build();

// Readiness endpoint - checks if app is ready
app.MapHealthChecks("/healthz/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready")
});

// Liveness endpoint - checks if app is alive
app.MapHealthChecks("/healthz/live", new HealthCheckOptions
{
    Predicate = _ => false  // Only checks if the app is running
});

app.Run();

💡 Audit Strategy: Log probe responses and status codes to understand probe behavior. Add middleware to log health check requests:

// Log health check requests
app.Use(async (context, next) =>
{
    if (context.Request.Path.StartsWithSegments("/healthz"))
    {
        var logger = context.RequestServices.GetRequiredService<ILogger<Program>>();
        logger.LogInformation("Health check: {Path} from {Ip}", 
            context.Request.Path, context.Connection.RemoteIpAddress);
    }
    await next();
});

🔄 6. CrashLoopBackOff: Root Cause and Recovery

CrashLoopBackOff is a common state indicating a pod is restarting repeatedly. Understanding its mechanics helps you resolve issues quickly.

📚 What CrashLoopBackOff Means

When a pod fails repeatedly, Kubernetes implements an exponential backoff strategy:

Initial restart: Immediate
First backoff: 10 seconds
Subsequent backoffs: 20s, 40s, 80s, 160s (capped at 300s)
Maximum wait: 5 minutes between restart attempts

⚠️ Interpreting Backoff Timings: Longer backoff periods indicate the pod has been failing for an extended period. Check the restart count and recent events to understand the failure pattern.

🔧 Strategies for Resolution

1. Increase initialDelaySeconds

If your application needs more time to start, increase the initial delay:

livenessProbe:
  httpGet:
    path: /healthz/live
    port: 8080
  initialDelaySeconds: 30  # Increased from 10
  periodSeconds: 10

2. Add Retry Logic in App Startup

Implement retry logic for external dependencies:

// Program.cs - Retry logic for database connection
var retryPolicy = Policy
    .Handle<SqlException>()
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
        onRetry: (exception, timeSpan, retryCount, context) =>
        {
            logger.LogWarning("Retry {RetryCount} after {Delay}s", retryCount, timeSpan.TotalSeconds);
        });

await retryPolicy.ExecuteAsync(async () =>
{
    // Initialize database connection
    await dbContext.Database.EnsureCreatedAsync();
});

3. Use postStart Lifecycle Hooks

Perform initialization tasks after container start:

lifecycle:
  postStart:
    exec:
      command:
      - /bin/sh
      - -c
      - sleep 10 && echo "Initialization complete"

🔍 Auto-Detect CrashLoopBackOff Pods

Use this script to quickly identify all pods in CrashLoopBackOff state:

#!/bin/bash
# Find all CrashLoopBackOff pods
kubectl get pods --all-namespaces | grep CrashLoopBackOff

# Get detailed information for each
for pod in $(kubectl get pods --all-namespaces \
  -o jsonpath='{range .items[?(@.status.containerStatuses[*].state.waiting.reason=="CrashLoopBackOff")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'); do
  namespace=$(echo $pod | cut -f1)
  name=$(echo $pod | cut -f2)
  echo "=== $namespace/$name ==="
  kubectl describe pod $name -n $namespace | grep -A 5 "State:"
  kubectl logs $name -n $namespace --tail=20
  echo ""
done

🔬 7. Advanced Diagnostics

When standard logs and events aren't enough, use advanced diagnostic techniques to inspect running containers.

⚡ Using kubectl exec for Live Inspection

Execute commands inside running containers:

# Open an interactive shell
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Or for bash
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

# Run a specific command
kubectl exec <pod-name> -n <namespace> -- ps aux
kubectl exec <pod-name> -n <namespace> -- env
kubectl exec <pod-name> -n <namespace> -- ls -la /app

🐳 Mounting Debug Sidecars

Add a debug container to your pod for troubleshooting:

spec:
  containers:
  - name: my-app
    image: my-app:latest
  - name: debug
    image: busybox:latest
    command: ["sleep", "3600"]
    volumeMounts:
    - name: app-volume
      mountPath: /shared

📁 Inspecting System Directories

Examine system files for additional context:

# Check process information
kubectl exec <pod-name> -- cat /proc/1/status

# View environment variables
kubectl exec <pod-name> -- env | sort

# Check mounted volumes
kubectl exec <pod-name> -- mount

# Inspect network configuration
kubectl exec <pod-name> -- cat /etc/resolv.conf

🛠️ .NET-Specific Diagnostic Tools

Use .NET diagnostic tools inside containers:

dotnet-dump

Capture memory dumps for analysis:

# Install dotnet-dump in container
kubectl exec <pod-name> -- dotnet tool install -g dotnet-dump

# Capture a dump
kubectl exec <pod-name> -- dotnet-dump collect -p 1

# Copy dump out of container
kubectl cp <namespace>/<pod-name>:/tmp/core_*.dmp ./core.dmp

dotnet-trace

Collect tracing information:

# Install dotnet-trace
kubectl exec <pod-name> -- dotnet tool install -g dotnet-trace

# Collect trace
kubectl exec <pod-name> -- dotnet-trace collect -p 1 --format speedscope

dotnet-counters

Monitor performance counters:

# Monitor counters in real-time
kubectl exec <pod-name> -- dotnet-counters monitor -p 1 \
  --counters System.Runtime,Microsoft.AspNetCore.Hosting

🛡️ 8. Preventive Measures and Best Practices

Preventing issues is better than diagnosing them. Implement these practices to reduce container crash incidents.

🏥 Use Health Check Endpoints in .NET

Always implement comprehensive health checks:

// Program.cs - Comprehensive health checks
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddCheck<DatabaseHealthCheck>("database")
    .AddCheck<CacheHealthCheck>("cache")
    .AddCheck<ExternalApiHealthCheck>("external-api");

// Custom health check implementation
public class DatabaseHealthCheck : IHealthCheck
{
    private readonly MyDbContext _dbContext;

    public DatabaseHealthCheck(MyDbContext dbContext)
    {
        _dbContext = dbContext;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context, 
        CancellationToken cancellationToken = default)
    {
        try
        {
            await _dbContext.Database.CanConnectAsync(cancellationToken);
            return HealthCheckResult.Healthy();
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy("Database connection failed", ex);
        }
    }
}

✅ Validate Native Dependencies During CI

Add checks to your CI/CD pipeline:

# .github/workflows/validate-native-deps.yml
name: Validate Native Dependencies

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Build and test
      run: |
        dotnet build
        dotnet test
    - name: Check RID targeting
      run: |
        dotnet publish -r linux-x64 --self-contained
        ldd ./bin/Release/net9.0/linux-x64/publish/MyApp | grep "not found" && exit 1 || exit 0

🧪 Container Startup Smoke Tests

Test containers before deployment:

# smoke-test.sh
#!/bin/bash
set -e

IMAGE=$1
PORT=8080

# Start container
CONTAINER_ID=$(docker run -d -p $PORT:8080 $IMAGE)

# Wait for startup
sleep 10

# Test health endpoint
curl -f http://localhost:$PORT/healthz/ready || exit 1

# Test liveness endpoint
curl -f http://localhost:$PORT/healthz/live || exit 1

# Cleanup
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

echo "Smoke tests passed!"

☁️ Infrastructure as Code Integration

Use Terraform or Helm to inject probe configurations:

# terraform - kubernetes_deployment.tf
resource "kubernetes_deployment" "app" {
  metadata {
    name = "my-app"
  }
  spec {
    template {
      spec {
        container {
          name  = "my-app"
          image = "my-app:${var.image_tag}"
          
          liveness_probe {
            http_get {
              path = "/healthz/live"
              port = 8080
            }
            initial_delay_seconds = var.liveness_initial_delay
            period_seconds        = var.liveness_period
          }
          
          readiness_probe {
            http_get {
              path = "/healthz/ready"
              port = 8080
            }
            initial_delay_seconds = var.readiness_initial_delay
            period_seconds        = var.readiness_period
          }
        }
      }
    }
  }
}

💡 Best Practices Summary:

Always implement health check endpoints
Test native dependencies in CI/CD
Run smoke tests before deployment
Use structured logging for better observability
Configure appropriate resource limits and requests
Monitor probe response times and success rates
Document troubleshooting procedures for common issues

📚 9. Conclusion

Diagnosing .NET container crashes in Kubernetes requires a systematic approach that combines multiple diagnostic techniques. By following the workflow outlined in this guide, you can quickly identify and resolve issues.

🔄 Recap of Diagnostic Flow

Initial Triage: Check pod status and Kubernetes events
Container Logs: Examine logs for application-level errors
Health Probes: Verify probe configuration and endpoints
CrashLoopBackOff: Understand backoff mechanics and apply appropriate fixes
Advanced Diagnostics: Use exec and diagnostic tools for deeper inspection
Prevention: Implement health checks, validation, and monitoring

💡 Key Takeaways:

Always start with kubectl describe pod and events
Use structured logging for better diagnostics
Configure health probes appropriately - don't be too aggressive
Implement retry logic for external dependencies
Test containers locally before deployment
Monitor and log probe responses

🚀 Next Steps

Continue improving your container debugging skills:

Set up comprehensive monitoring and alerting
Create diagnostic scripts for common issues
Document your troubleshooting procedures
Share knowledge with your team
Contribute diagnostic tools and scripts to open-source projects

📖 Resources:

Kubernetes Documentation: Configure Liveness, Readiness and Startup Probes
.NET Health Checks: ASP.NET Core Health Checks
.NET Diagnostics: .NET Diagnostics Documentation

Wednesday, 5 November 2025