Architecture Overview
canaryRoutingMiddleware runs as a Fastify preHandler hook on every request. If the resolved platformId is in the canary list, the request is proxied in-process to the canary app and the response is returned directly to the caller — the primary app never processes it further.
Canary membership: DB-backed with Redis cache
Canary platform membership is stored in theplatform_plan.canary boolean column. On each lookup, the list of canary platform IDs is fetched from Redis (canary-platform-ids key). On cache miss the list is read from the database and cached. When the canary flag is changed via the API, the cache is invalidated immediately.
This removes the need to keep AP_CANARY_PLATFORM_IDS in sync across services and allows runtime changes without a redeploy.
Components
Canary App
A second instance of the server API running a different image tag. No special env vars are needed for canary membership — it is driven entirely by the DB. Configure it with:| Env var | Value |
|---|---|
AP_FRONTEND_URL | canary app’s public URL |
runsMetadata, system-job-queue) run on both the primary and canary app — this is safe because BullMQ ensures each job is processed by exactly one consumer.
Canary Workers
Dedicated workers that connect to the canary app instead of the primary app:| Env var | Value |
|---|---|
AP_FRONTEND_URL | canary app’s URL |
AP_IS_CANARY_WORKER | true |
AP_FRONTEND_URL for their Socket.IO RPC channel and for posting engine results back to the app. Canary workers are registered with the canary app, so the RPC path is fully isolated.
Primary App Config
The primary app only needs to know where to proxy canary requests:| Env var | Description |
|---|---|
AP_CANARY_APP_URL | Internal URL of the canary app |
WebSocket / Real-time Updates
Socket.IO is configured with a Redis adapter (@socket.io/redis-adapter). Events emitted on any app instance (primary or canary) are broadcast through Redis pub/sub to all connected instances. This means:
- Users connected to the primary app receive real-time flow run updates even when the execution happened on the canary app.
- No WebSocket proxying is required.
Queue Isolation
| Queue | Primary App | Canary App |
|---|---|---|
workerJobs | ✅ Consumed by primary workers | Not consumed |
runsMetadata | ✅ Consumed | ✅ Consumed |
system-job-queue | ✅ Consumed | ✅ Consumed |
canaryWorkerJobs | Not consumed | ✅ Consumed by canary workers |
canaryWorkerJobs, which only canary workers poll — fully isolated from the primary worker fleet.
Deploying a New Canary Build
The Continuous Delivery — Canary workflow (continuous-delivery-canary.yml) runs automatically every day at 9 AM UTC, building from the latest main.
The workflow:
- Builds and pushes a new image tagged
<version>.<sha>.canary - Checks if any new migrations are breaking — fails the workflow if they are (no override)
- Deploys the canary app (
config/app-canary.yml) and canary workers
AP_API_KEY (the primary app’s AP_API_KEY value).
Rolling Back a Canary Deployment
Trigger the Continuous Delivery — Rollback Canary workflow (continuous-delivery-rollback-canary.yml) with the image tag you want to roll back to. The workflow:
- Extracts the migration manifest from the target image
- Rolls back DB migrations not present in that manifest
- Redeploys the canary app and workers to the target image
| Input | Description |
|---|---|
rollback_to_image_tag | Image tag to roll back to (e.g. 0.51.0.abc1234.canary) |
force | Force rollback even if breaking migrations exist. Default: false. |
Promoting canary build to production
Canary is a validation environment, not a promotion path. Full promotion happens via the Sunday scheduled cloud workflow, which deploysrelease-candidate to production. If a canary build has been validated and the corresponding commit has been tagged as release-candidate, it will automatically reach production on the next Sunday.
Managing Platform Routing
Enable canary routing for a platform
Disable canary routing for a platform
platform_plan.canary in the database and invalidate the Redis cache (canary-platform-ids). The change takes effect on the next request — no restart required.