At-Least-Once Delivery with NATS JetStream
How LZStock guarantees zero message loss and prevents 'Poison Pill' queue blockages using NATS JetStream explicit acknowledgments.
- Eradicate Temporal Coupling: Replaced fragile synchronous HTTP chains with NATS JetStream, leveraging durable consumers and asynchronous event streaming to guarantee zero message loss during downstream database failovers or microservice restarts.
- The Poison Pill Defense: Mitigated infinite queue starvation by intelligently bifurcating error handling: transient errors (e.g., DB timeouts) trigger
Nak()for backoff redelivery, while fatal payload errors (Poison Pills) are explicitlyAck()ed and discarded to keep the pipeline flowing. - Enforcing Idempotent Consumers: Embraced the architectural trade-off of "At-Least-Once" delivery by strictly requiring idempotent downstream database operations (e.g., SQL UPSERTs), immunizing the system against duplicate processing caused by network-induced
Ack()timeouts.
The Objective
In a microservices architecture, synchronous HTTP calls create temporal coupling. If the MarketService calls the NotificationService via HTTP, and the notification database is temporarily down, the market service request fails.
The objective is to decouple these services using an Event Streaming Platform. By leveraging NATS JetStream, we aim to achieve an At-Least-Once delivery guarantee. The messaging system must hold the event safely until the consumer explicitly confirms successful processing, ensuring zero data loss during microservice restarts or database failovers.
The Mental Model & Message Lifecycle
To achieve true reliability, we abandon "Auto-Ack" mechanisms in favor of Explicit Acknowledgments. The consumer must decide the fate of the message based on the exact nature of the processing error.
shared/go/infra
└── jetStream
├── client.go
├── errors.go
├── interceptor.go
├── publish.go
├── stream.go
├── subscribe.go
└── test.go
Core Implementation
The implementation acts as a centralized wrapper around the nats-io/nats.go SDK, enforcing our strict delivery policies across all microservices.
Stream Topology Definition
We define the stream dynamically on startup. Notice the use of InterestPolicy; NATS will only retain messages if there is an active consumer interested in them, optimizing memory usage.
// internal/infra/natsclient/stream.go
func CreateOrUpdateStream(ctx context.Context, jsc jetstream.JetStream, streamName string) {
config := jetstream.StreamConfig{
Name: streamName,
MaxAge: 1 * time.Hour,
MaxBytes: 30 * 1024 * 1024, // 30MB Cap
Replicas: 3, // High Availability across NATS nodes
Subjects: []string{fmt.Sprintf("lzstock.%s.*", streamName)},
Storage: jetstream.MemoryStorage, // Blazing fast, replicated across cluster
Retention: jetstream.InterestPolicy,
}
// Creates or updates the stream definition idempotently
_, err := jsc.CreateOrUpdateStream(ctx, config)
if err != nil {
apperr.ThrowPanic(fmt.Errorf("failed to provision JetStream %s: %w", streamName, err))
}
}
The Resilient Consumer (Poison Pill Defense)
This is the core of our message processing strategy. It enforces jetstream.AckExplicitPolicy and intelligently categorizes errors to prevent queue starvation.
// internal/infra/natsclient/consumer.go
func SubscribeReliable(jsc jetstream.JetStream, subject, streamName, consumerName string, callback func(*nats.Msg) error) {
ctx := context.Background()
// 1. Define a Durable Consumer
// "Durable" means if the pod crashes, NATS remembers exactly which messages it hasn't Acked yet.
consumer, _ := jsc.CreateOrUpdateConsumer(ctx, streamName, jetstream.ConsumerConfig{
Durable: consumerName,
FilterSubject: subject,
AckPolicy: jetstream.AckExplicitPolicy,
DeliverPolicy: jetstream.DeliverAllPolicy,
AckWait: 10 * time.Second, // Timeout before NATS assumes the consumer died
MaxDeliver: 5, // Max retry attempts before moving to Dead Letter Queue
})
// 2. Consume with explicit lifecycle control
consumer.Consume(func(msg jetstream.Msg) {
natsMsg := &nats.Msg{Subject: msg.Subject(), Data: msg.Data()}
err := callback(natsMsg)
if err == nil {
msg.Ack() // Success
return
}
// 3. The Poison Pill Defense
if isTransientError(err) {
log.Printf("[Transient] DB down, Nak() to redeliver: %s", msg.Subject())
msg.Nak() // Ask NATS to send it again
} else {
// If the JSON is malformed, retrying won't fix it.
// We Ack() it to remove it from the queue and send it to an error log.
log.Printf("[Fatal] Poison Pill detected, Ack() to discard: %s", msg.Subject())
msg.Ack()
}
})
}
// isTransientError determines if a retry is mathematically possible
func isTransientError(err error) bool {
// e.g., Network timeouts, DB deadlocks, Rate limits
return errors.Is(err, context.DeadlineExceeded) || errors.Is(err, sql.ErrConnDone)
}
Edge Cases & Trade-offs
- The Poison Pill Problem: A "Poison Pill" is a message that can never be processed successfully (e.g., corrupted JSON payload or invalid schema). If the consumer uses a naive "Nak on any error" strategy, NATS will redeliver the Poison Pill infinitely (or up to MaxDeliver). This blocks the single-threaded consumer from processing all subsequent healthy messages. By explicitly checking isTransientError(err), we ensure that fatal errors are Acked (discarded/logged) to keep the pipeline flowing.
- At-Least-Once vs. Exactly-Once (Idempotency requirement): With explicit Acks, what happens if the consumer successfully writes to the PostgreSQL database, but the network drops right before it calls msg.Ack()? After 10 seconds (AckWait), NATS will assume the consumer died and redeliver the message to another pod. The new pod will process it again. Therefore, the architectural trade-off of At-Least-Once delivery is that every downstream database operation must be Idempotent (using UPSERTs or unique constraint checks) to prevent duplicate data insertion.
- Kafka vs. NATS JetStream: Why not use Kafka? Kafka is the industry standard for event streaming, but it requires massive operational overhead (JVM tuning, Zookeeper/KRaft clusters). NATS is a single, statically compiled Go binary that delivers extremely low latency and built-in Key-Value stores (which LZStock reuses for its Service Registry). For a tightly-coupled Go microservices ecosystem, NATS provides 95% of Kafka's features with 5% of the operational complexity.
The Outcome
By utilizing NATS JetStream with strict, explicit acknowledgment policies and intelligent error categorization, LZStock's microservices can survive database outages and gracefully resume processing millions of asynchronous events without losing a single byte of data or suffering from queue starvation.