Concurrent Probing and Pprof Profiling
How LZStock prevents Kubernetes false-positive restarts using concurrent health checks, while exposing Prometheus metrics and conditional pprof endpoints.
- Eradicate False-Positive K8s Restarts: Replaced sequential, blocking health checks with a concurrent background probing engine bounded by strict
context.WithTimeoutbudgets, guaranteeing instant Kubernetes probe responses regardless of downstream network degradation. - Lock-Free State Management: Avoided the fatal "Network I/O Mutex Trap" by isolating dependency pings in background goroutines, acquiring the global lock for mere microseconds only to swap state pointers, ensuring the HTTP
/healthhandler never hangs. - Secure Production Profiling: Decoupled observability traffic from core business routing by mounting Prometheus metrics and Go
pprofendpoints on a dedicated, internal-only HTTP Mux, enabling deep CPU/Memory tracing on demand without exposing critical attack vectors.
The Objective
Kubernetes relies on Liveness and Readiness probes to determine if a Pod should receive traffic or be restarted. If a microservice has 4 dependencies (NATS, Redis, Mongo, Postgres), probing them sequentially could take 10+ seconds during network degradation. This exceeds the Kubernetes probe timeout, causing K8s to falsely assume the Pod is dead and repeatedly restart it (CrashLoopBackOff).
The objective is to build a Concurrent Health Probing Engine that validates all infrastructure dependencies in parallel within a strict 3-second absolute timeout. Furthermore, this module must serve as the central Observability Hub, exposing Prometheus metrics and Go pprof endpoints for real-time CPU/Memory profiling during production incidents.
The Mental Model & Non-Blocking Probes
To prevent blocking the HTTP /health endpoint, the background probing engine uses a localized tempStatus struct. It only acquires the global Mutex for a microsecond to swap the pointer, ensuring K8s probes are answered instantly (in time).
Codebase Anatomy
shared/go/infra
├── healthCheck
│ └── index.go
├── livenessServer
│ └── server.go
└── serviceRegister
├── nats
│ └── nats.go
└── registry
└── registry.go
Core Implementation
The implementation is split into the background Concurrent Checker and the HTTP Observability Router.
The Concurrent, Lock-Free Probing Engine
Notice how the global Mutex (h.mu.Lock()) is never held during network calls. We use a sync.WaitGroup and a strict context.WithTimeout to cap the execution time.
// internal/infra/healthcheck/healthcheck.go
func (h *HealthCheck) performCheck() {
// 1. Absolute Timeout Budget: Never exceed K8s probe timeouts
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
var wg sync.WaitGroup
var localMu sync.Mutex // Protects tempStatus from concurrent goroutine writes
// 2. Localized state (Prevents locking the global state during network I/O)
tempStatus := HealthStatus{Status: "healthy", Postgres: true, Redis: true, NATS: true}
// 3. Concurrent Pings
if h.dependencies.PostgresClient != nil {
wg.Add(1)
go func() {
defer wg.Done()
sqlDB, _ := h.dependencies.PostgresClient.DB()
if sqlDB.PingContext(ctx) != nil {
localMu.Lock()
tempStatus.Postgres = false
tempStatus.Status = "unhealthy"
localMu.Unlock()
}
}()
}
// ... (Redis and NATS checks follow the exact same goroutine pattern) ...
// 4. Wait for all checks or the 3-second context timeout
wg.Wait()
// 5. Atomic Global Update: Lock held for < 1 microsecond
h.mu.Lock()
h.status = tempStatus
h.healthy = (tempStatus.Status == "healthy")
h.mu.Unlock()
}
The Observability Mux & pprof Profiling
The Liveness Server acts as a dedicated HTTP server running on a separate port (e.g., 8081), ensuring observability traffic does not interfere with the main gRPC/FastHTTP business traffic.
// internal/infra/liveness/server.go
func (s *LivenessServer) Start() {
defer s.Stop()
mux := http.NewServeMux()
// 1. Core Observability
mux.Handle("/health", http.HandlerFunc(s.healthCheck.HealthHandler))
mux.Handle("/metrics", promhttp.Handler()) // Expose Prometheus metrics
// 2. Conditional Pprof Profiling
if s.enableDebug {
log.Printf("Debug mode enabled: mounting pprof endpoints on %s\n", s.Server.Addr)
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)
// Enable deep profiling for block & mutex contentions
runtime.SetBlockProfileRate(1)
runtime.SetMutexProfileFraction(1)
} else {
// Disable profiling in standard production to save CPU cycles
runtime.SetBlockProfileRate(0)
runtime.SetMutexProfileFraction(0)
}
s.Server.Handler = mux
s.Server.ListenAndServe()
}
Edge Cases & Trade-offs
- Mutex Scope & Network I/O (The Rookie Trap): A common mistake in Go is locking a sync.RWMutex, making 3 sequential database pings, and then unlocking it. If a database hangs, the lock is held indefinitely. When K8s queries /health, the HTTP handler blocks waiting for RLock(), causing the Liveness Probe to timeout and K8s to mercilessly kill the Pod. By isolating the network calls inside goroutines and only locking to swap the final tempStatus, our /health endpoint always returns in constant time, completely immune to DB latency.
- Timeout Budgets (context.WithTimeout): Kubernetes livenessProbe.timeoutSeconds is typically set to 5 seconds. If our internal health check takes 6 seconds, we fail the K8s check. Wrapping all concurrent probes in a strict 3-second context.WithTimeout ensures that even in the worst-case network partition, the internal check concludes and marks the component as "unhealthy" fast enough for K8s to receive the JSON report rather than a dropped connection.
- Pprof Security vs. Observability: Go's net/http/pprof is a godsend for debugging memory leaks and CPU bottlenecks in production. However, exposing it to the public internet is a critical security vulnerability.The Trade-off: We run the Liveness Server on an internal-only port (e.g., 8081) that is not exposed by the K8s Ingress Controller. Furthermore, we wrap the pprof handlers in an enableDebug environment variable toggle, ensuring that trace profiling (which adds overhead) is only activated when SREs actively need to debug the system.
The Outcome
By separating Observability into its own dedicated internal port and enforcing strict, lock-free concurrent health checks, LZStock achieves deep system visibility (Metrics + Pprof) while completely eliminating Kubernetes false-positive restarts caused by cascading database latency.