Concurrent Probing and Pprof Profiling

How LZStock prevents Kubernetes false-positive restarts using concurrent health checks, while exposing Prometheus metrics and conditional pprof endpoints.

TL;DR

Eradicate False-Positive K8s Restarts: Replaced sequential, blocking health checks with a concurrent background probing engine bounded by strict context.WithTimeout budgets, guaranteeing instant Kubernetes probe responses regardless of downstream network degradation.
$O(1)$ Lock-Free State Management: Avoided the fatal "Network I/O Mutex Trap" by isolating dependency pings in background goroutines, acquiring the global lock for mere microseconds only to swap state pointers, ensuring the HTTP /health handler never hangs.
Secure Production Profiling: Decoupled observability traffic from core business routing by mounting Prometheus metrics and Go pprof endpoints on a dedicated, internal-only HTTP Mux, enabling deep CPU/Memory tracing on demand without exposing critical attack vectors.

The Objective

Kubernetes relies on Liveness and Readiness probes to determine if a Pod should receive traffic or be restarted. If a microservice has 4 dependencies (NATS, Redis, Mongo, Postgres), probing them sequentially could take 10+ seconds during network degradation. This exceeds the Kubernetes probe timeout, causing K8s to falsely assume the Pod is dead and repeatedly restart it (CrashLoopBackOff).

The objective is to build a Concurrent Health Probing Engine that validates all infrastructure dependencies in parallel within a strict 3-second absolute timeout. Furthermore, this module must serve as the central Observability Hub, exposing Prometheus metrics and Go pprof endpoints for real-time CPU/Memory profiling during production incidents.

The Mental Model & Non-Blocking Probes

To prevent blocking the HTTP /health endpoint, the background probing engine uses a localized tempStatus struct. It only acquires the global Mutex for a microsecond to swap the pointer, ensuring K8s probes are answered instantly (in $O(1)$ time).

Codebase Anatomy

shared/go/infra
├── healthCheck
│   └── index.go
├── livenessServer
│   └── server.go
└── serviceRegister
    ├── nats
    │   └── nats.go
    └── registry
        └── registry.go

Core Implementation

The implementation is split into the background Concurrent Checker and the HTTP Observability Router.

The Concurrent, Lock-Free Probing Engine

Notice how the global Mutex (h.mu.Lock()) is never held during network calls. We use a sync.WaitGroup and a strict context.WithTimeout to cap the execution time.

// internal/infra/healthcheck/healthcheck.go

func (h *HealthCheck) performCheck() {
    // 1. Absolute Timeout Budget: Never exceed K8s probe timeouts
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()

    var wg sync.WaitGroup
    var localMu sync.Mutex // Protects tempStatus from concurrent goroutine writes

    // 2. Localized state (Prevents locking the global state during network I/O)
    tempStatus := HealthStatus{Status: "healthy", Postgres: true, Redis: true, NATS: true}

    // 3. Concurrent Pings
    if h.dependencies.PostgresClient != nil {
        wg.Add(1)
        go func() {
            defer wg.Done()
            sqlDB, _ := h.dependencies.PostgresClient.DB()
            if sqlDB.PingContext(ctx) != nil {
                localMu.Lock()
                tempStatus.Postgres = false
                tempStatus.Status = "unhealthy"
                localMu.Unlock()
            }
        }()
    }

    // ... (Redis and NATS checks follow the exact same goroutine pattern) ...

    // 4. Wait for all checks or the 3-second context timeout
    wg.Wait()

    // 5. Atomic Global Update: Lock held for < 1 microsecond
    h.mu.Lock()
    h.status = tempStatus
    h.healthy = (tempStatus.Status == "healthy")
    h.mu.Unlock()
}

The Observability Mux & pprof Profiling

The Liveness Server acts as a dedicated HTTP server running on a separate port (e.g., 8081), ensuring observability traffic does not interfere with the main gRPC/FastHTTP business traffic.

// internal/infra/liveness/server.go

func (s *LivenessServer) Start() {
    defer s.Stop()
    mux := http.NewServeMux()

    // 1. Core Observability
    mux.Handle("/health", http.HandlerFunc(s.healthCheck.HealthHandler))
    mux.Handle("/metrics", promhttp.Handler()) // Expose Prometheus metrics

    // 2. Conditional Pprof Profiling
    if s.enableDebug {
        log.Printf("Debug mode enabled: mounting pprof endpoints on %s\n", s.Server.Addr)
        mux.HandleFunc("/debug/pprof/", pprof.Index)
        mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
        mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
        mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

        // Enable deep profiling for block & mutex contentions
        runtime.SetBlockProfileRate(1)
        runtime.SetMutexProfileFraction(1)
    } else {
        // Disable profiling in standard production to save CPU cycles
        runtime.SetBlockProfileRate(0)
        runtime.SetMutexProfileFraction(0)
    }

    s.Server.Handler = mux
    s.Server.ListenAndServe()
}

Edge Cases & Trade-offs

Mutex Scope & Network I/O (The Rookie Trap): A common mistake in Go is locking a sync.RWMutex, making 3 sequential database pings, and then unlocking it. If a database hangs, the lock is held indefinitely. When K8s queries /health, the HTTP handler blocks waiting for RLock(), causing the Liveness Probe to timeout and K8s to mercilessly kill the Pod. By isolating the network calls inside goroutines and only locking to swap the final tempStatus, our /health endpoint always returns in $O(1)$ constant time, completely immune to DB latency.
Timeout Budgets (context.WithTimeout): Kubernetes livenessProbe.timeoutSeconds is typically set to 5 seconds. If our internal health check takes 6 seconds, we fail the K8s check. Wrapping all concurrent probes in a strict 3-second context.WithTimeout ensures that even in the worst-case network partition, the internal check concludes and marks the component as "unhealthy" fast enough for K8s to receive the JSON report rather than a dropped connection.
Pprof Security vs. Observability: Go's net/http/pprof is a godsend for debugging memory leaks and CPU bottlenecks in production. However, exposing it to the public internet is a critical security vulnerability.The Trade-off: We run the Liveness Server on an internal-only port (e.g., 8081) that is not exposed by the K8s Ingress Controller. Furthermore, we wrap the pprof handlers in an enableDebug environment variable toggle, ensuring that trace profiling (which adds overhead) is only activated when SREs actively need to debug the system.

The Outcome

By separating Observability into its own dedicated internal port and enforcing strict, lock-free concurrent health checks, LZStock achieves deep system visibility (Metrics + Pprof) while completely eliminating Kubernetes false-positive restarts caused by cascading database latency.

The Objective​

The Mental Model & Non-Blocking Probes​

Core Implementation​

The Concurrent, Lock-Free Probing Engine​

The Observability Mux & pprof Profiling​

Edge Cases & Trade-offs​

The Outcome​