Streaming Engine to 5k+ Users (7) - Debugging Long-tail Resource Leaks
Test Goal
There is a long-tail resource leak issue that may affect the performance. It happened when the server was idle for a while after a load test without restarting the server.
Conclusion
I fixed the long-tail resource leak issue. No more error logs are shown after the load test.
If you are interested in the testing process, please continue reading.
Observations
Long-tail Resource Leaks
goroutine 23228 [chan receive]:
lzstock/mods/bc11-market-monitor/useCases.(*RedisSubscriptionManager).listenToRedis(0xc000463e90, {0xc00030c350, 0x3}, 0xc00098f7a0?)
/build/mods/bc11-market-monitor/useCases/RedisSubscriptionManager.go:88 +0x65
created by lzstock/mods/bc11-market-monitor/useCases.(*RedisSubscriptionManager).AddSubscription in goroutine 23223
/build/mods/bc11-market-monitor/useCases/RedisSubscriptionManager.go:54 +0x345
goroutine 23206 [select]:
lzstock/mods/bc11-market-monitor/useCases.(*StreamStockPriceUseCase).StreamPriceUpdates(0xc00013eee0, 0xc0004c19e0, {0x114bc30, 0xc0028285f0})
/build/mods/bc11-market-monitor/useCases/StreamStockPrice.go:49 +0x4d0
lzstock/mods/bc11-market-monitor/controllers.(*Controller).StreamPriceUpdates(0xc00304c780?, 0xf4cb40?, {0x114bc30?, 0xc0028285f0?})
/build/mods/bc11-market-monitor/controllers/StockPrice.go:52 +0xd7
lzstock/shared/go/protos/bc11_market_monitor._RealTimePriceMonitoringService_StreamPriceUpdates_Handler({0xea5ce0, 0xc00011a2d8}, {0x114a8b0, 0xc00304c780})
/build/shared/go/protos/bc11_market_monitor/bc11_market_monitor.service.pb.go:474 +0x110
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000478000, {0x1147190, 0xc00295d0b0}, 0xc00125c600, 0xc00047a000, 0x1837b00, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1722 +0x12e8
google.golang.org/grpc.(*Server).handleStream(0xc000478000, {0x1147830, 0xc000326000}, 0xc00125c600)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1846 +0xb47
google.golang.org/grpc.(*Server).serveStreams.func2.1()
/go/pkg/mod/google.golang.org/[email protected]/server.go:1061 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 1582
/go/pkg/mod/google.golang.org/[email protected]/server.go:1072 +0x11d
Root Cause Analysis: Long-tail Resource Leaks
Sender: RedisSubscriptionManager's broadcast loop.
Receiver: gRPC's StreamPriceUpdates.
Owner: sessionManager (because it creates the channel using make(chan)).
I got a mixed suggestions that I should close the channel actively or rely on GC to close it. I decided to leave it out there. Unfortunately, it is not a good idea. That was the reason for this issue.
// Session Manager
// 1. Detach this session's channel from the global Redis listener
for _, ticker := range session.Tickers {
sm.redisSubManager.RemoveSubscription(ticker, sessionID)
}
// 2. Close the channel
// close(session.PriceStream)
The easiest way to fix this issue is to uncomment the line that closes the channel. However, it is a bad smell. Relying on execution order to safely close a channel is a very fragile design
Solution
Never Close Data Channel, use context instead
// SubscribeStockPriceUseCase
func (u *SubscribeStockPriceUseCase) Executes(dto *dtos.StartPriceMonitoring) (err error) {
defer appErrs.Recover(&err)
if len(dto.Req.Tickers) == 0 {
return appErrs.ErrEmptyTickers
}
// Create a monitoring session
session := &session{
ID: uuid.New().String(),
...
PriceStream: make(chan *schema.MarketPrice, 1000),
+ ctx context.Context
+ cancel context.CancelFunc
}
return nil
}
func (sm *sessionManager) deactivateSessionLocked(sessionID string) {
session := sm.sessions[sessionID]
...
+ session.cancel()
...
}
func (u *StreamStockPriceUseCase) StreamPriceUpdates(dto *dtos.StreamPriceUpdates, stream pb.RealTimePriceMonitoringService_StreamPriceUpdatesServer) (err error) {
...
for {
select {
case price := <-session.PriceStream:
...
case <-flushTicker.C:
...
+ case <-session.ctx.Done():
+ return nil
case <-stream.Context().Done():
return return stream.Context().Err()
}
}
Result
Rule: Receiver and Manager should not close the data channel.
Solution: Use RemoveSubscription to block the data source, let PriceStream be collected by GC gracefully. Use context.CancelFunc to replace close, wake up and end the zombie Goroutine stuck in the select.
