Durable Execution

Replay and skip: on the first attempt the trigger and step 2 are logged before the worker dies at step 3; on resume a fresh worker skips the logged trigger and step 2, re-runs the in-flight step 3, then continues fresh from step 4

Without it, a worker that dies mid-flow would restart the whole run from the trigger and re-send emails, re-charge cards, and re-call APIs. Instead, a fresh worker reuses the saved output of every finished step and runs only the first step that had not completed. The same mechanism covers crashes, deploys, long pauses, and retries.

The run log

Every flow run has a run log: one compressed checkpoint file with everything needed to resume the run on a fresh worker. What is in it:

One entry per finished step, keyed by step name: input (secrets hidden), output, status, duration, and the error message for failed steps.
Loop iterations and router branches use the same shape, nested under their parent step.
Run-level tags.

When it is written:

Once at the start, before the first step runs.
Every 15 seconds while the run executes, from a background loop that snapshots whatever finished since the last write.
Once on the final state (success, failure, or pause).

Each write overwrites the previous copy. Only the latest checkpoint is kept, and the file is compressed before upload.

Replay and skip

Resume is not a special path. Each time a worker starts a run, it walks the flow graph from the trigger and asks at every step: is this step’s output already in the log?

If yes, and the step finished (SUCCEEDED or PAUSED), the engine returns the saved output and moves on.
If no, the engine runs the step, records its output, and continues.

On the first run the log is empty, so every step runs. After a resume the log is full up to the interruption, so the engine skips through all of it and runs only what came next. The most a crash can lose is the single step that was running when the worker died. That step runs again from the last checkpoint, and everything before it is skipped.

What triggers a resume

Every interruption resolves through the same replay path. Only the trigger differs.

Worker crash or deployment. The queue reassigns the run to another worker, which loads the log and replays.
Paused step. The piece creates a waitpoint. When the waitpoint fires, a resume job is queued and a worker replays the run.
Retry from a failed step. The same log is reused. The run is re-queued and a worker replays from the failure point.
Normal progression within one worker. Same replay model, without leaving the process.

Get Started

Configure & Operate

Troubleshooting

Reference

Guarantees

Architecture

The run log

Replay and skip

What triggers a resume

​The run log

​Replay and skip

​What triggers a resume

The run log

Replay and skip

What triggers a resume