Summary
On March 19, 2026, a Delay step bug caused flows to loop forever instead of resuming, flooding Redis with jobs. A simultaneous redeployment of all app servers made things worse — queues backed up and workers threw “No handler for job” errors. We fixed it by deleting the duplicate runs, patching the Delay step, and redeploying servers one at a time.Impact
- Flows with a Delay step looped forever without completing.
- Other flows failed or were delayed due to the Redis overload.
- All affected executions were replayed once service was restored — no data was lost.
Timeline
All times are in UTC. Mar 18 — A code change to the Delay step is deployed, introducing the infinite loop bug.- Mar 19, ~12:59 AM — Jobs become stuck in queues. All app servers had been redeployed simultaneously, overloading an already-stressed Redis. Investigation begins.
- Mar 19, ~1:00 AM — “No handler for job” errors identified — workers consuming jobs before handlers are registered. System job concurrency set to zero. Duplicate scheduled runs from the runaway Delay loop deleted. Decision made to deploy one app server at a time.
- Mar 19, ~1:24 AM — A single app server brought up successfully. Processes the entire runs metadata backlog in ~2 minutes. Redis load returns to normal.
- Mar 19, ~1:28 AM — Remaining servers deployed incrementally. Service fully restored.
- Mar 19, ~10:45 AM — Root cause identified and fixed (Delay step bug & Redis overload).
Root Cause
Delay step infinite loop (primary cause): When a flow hits a Delay step, the system puts the job on hold via BullMQ’smoveToDelayed(). The bug was that the job still carried executionType: BEGIN instead of RESUME. When the delay expired, the worker re-ran the entire flow from the first step, hit the Delay again, paused again, and looped forever — flooding Redis with jobs.
- Simultaneous server startup: All app servers redeployed at once, creating a thundering herd on an already-overloaded Redis.
- Heavy rate limiter Redis usage: The rate limiter added further load.
- System job handler race condition: Workers consumed jobs before handlers were registered, producing “No handler for job” errors.
Action Items
| Action Item | Status |
|---|---|
Update job data to RESUME before calling moveToDelayed() | Done |
| Register system job handlers before worker initialization | Done |
| Add test coverage for delay step resume behavior | Done |
| Optimize rate limiter Redis usage (ENG-319) | To do |
| Lower concurrency for runs metadata processing (ENG-318) | To do |
| Prevent a flow from entering infinite state in any way (ENG-320) | Done |
Improvements Done
- Delay step fix — Updated job data to
executionType: RESUMEbefore callingmoveToDelayed(), so the worker continues from where the flow left off instead of restarting. - System job handler registration fix — Split
systemJobsSchedule.init()into two phases:init()(creates the queue) andstartWorker()(starts consuming jobs).init()runs early so modules can callupsertJobduring registration, whilestartWorker()runs after all handlers are registered (PR #12048). - Defense-in-depth: RESUME empty state guard — Worker validates that RESUME operations have non-empty execution state. An empty state with RESUME is the exact signature of the original bug and is rejected with
VALIDATIONerror. - Defense-in-depth: BEGIN non-empty state assertion — Engine asserts that BEGIN operations have empty execution state. A BEGIN with pre-existing steps would indicate a code regression.