David Négrier

CTO & Founder

December 16, 2025

Tech

Blue Green Deployment in the context of real time applications

How we implemented Blue Green deployment in the context of a real-time collaborative application.

WorkAdventure is a real time collaboration platform where users join virtual spaces to interact. Unlike a classic web app, users can stay connected for hours, with continuous audio and video streams and a shared real time state (notably player positions). This changes what “safe deployment” means: even short interruptions are immediately visible, and split brain situations (two versions of the same world running at the same time) can break the experience.

This article explains how we used to deploy, why common Kubernetes rollout patterns did not work for us, which alternatives we explored (including one we abandoned), and the Blue Green strategy we built to reach deployments without deliberate downtime.

How we performed deployments before

Historically, deploying a new version meant taking the service offline briefly, applying the update, then bringing everything back up. All connected users experienced a disconnection during that window.

We invested heavily to reduce the duration of this downtime. We moved from Docker based deployments to Kubernetes, and we optimized our update procedure. In particular, we reduced the downtime from about 5 minutes to about 20 seconds by ensuring container images were already present on every Kubernetes node before the switch. Kubernetes does not provide native image preloading, but we achieved a similar effect using an appropriate DaemonSet pattern to pull images ahead of time.

However, for a real time product like WorkAdventure, even 20 seconds is disruptive: users are disconnected, audio and video streams drop, and meetings or events are interrupted. This is also why uptime percentage alone is a poor SLA proxy for real time systems. See our article for details: https://workadventu.re/tech/uptime-percentage-is-bad-sla-for-real-time-apps/

To limit user impact, we scheduled deployments during off peak hours (night or weekends). Over the past years, as WorkAdventure usage became global, this approach stopped scaling: there is no longer a quiet period. When users in Europe sleep, users in the Americas are active, and when the Americas wind down, Asia starts its workday.

Why Blue Green is needed and why it is hard for us

Blue Green Deployment is a common strategy to reduce downtime by maintaining two production environments:

Blue, currently serving traffic
Green, idle and ready to receive the next version

A new version is deployed to the idle environment, validated, then traffic is switched from the live environment to the updated one.

For many web apps, this traffic switch can be gradual (a percentage at a time) using cookies or headers. That pattern is risky for WorkAdventure because users must remain in a consistent shared state. If two users are “in the same world” but end up routed to different versions, they are no longer sharing the same realtime backend state and will not see the same reality.

So the core technical constraint is:

A world must be served by exactly one version at a time.

That requirement conflicts with most rollout mechanisms that split traffic per user session.

Blue Green in typical web applications

In a typical web app:

Deploy the new version to the idle environment
Validate it (health checks, smoke tests, functional tests)
Switch traffic gradually (for example 10 percent then 50 percent then 100 percent)
Roll back quickly if needed

This is usually implemented by tagging a user session (often via a cookie) and letting the load balancer route requests accordingly.

This is not a fit for WorkAdventure because we cannot safely have users inside the same world split across two versions.

Blue Green in session based game servers and why it does not map to WorkAdventure

WorkAdventure is closer to a game than to a classic web app, so we also looked at how game servers handle upgrades.

In many non MMO games, sessions are short lived. Players join a match, the match ends, everyone disconnects, and the server becomes empty. Blue Green is simple there:

stop sending new sessions to the old server
wait until the server is empty
update it and put it back in rotation

Kubernetes based platforms like Agones exist specifically for this match lifecycle model.

WorkAdventure does not match that pattern:

users can stay connected for hours
a “world” is long lived, not a short match
running a dedicated process per world would be too resource intensive and complex in our architecture

So we needed a different approach that preserves consistency while still allowing progressive rollout.

Our biggest obstacle is that WorkAdventure maintains the real time state of a world inside a unique backend service. Conceptually, this service is the single source of truth for the world state (notably player positions). If we simply duplicated the environment, we would end up with two separate sources of truth, which is exactly what we must avoid.

One idea was to externalize the world state into a datastore suited for real time workloads, such as Redis, and have both Blue and Green read and write the same shared state.

This would have reduced the “single backend instance” constraint, but it raised two major issues for WorkAdventure:

Compatibility constraints between versions: during rollout, both versions would need to read and write the same state structure safely, similarly to how database schemas must remain compatible during classic Blue Green transitions.
Performance risk: WorkAdventure shares position updates multiple times per second and can produce tens of thousands of writes per second in a large world. Introducing a network hop to a datastore for each update (plus the corresponding reads) would add latency and operational complexity. Even with a fast datastore, that latency budget is extremely tight for a smooth real time experience.

Because of these risks, we abandoned this option.

Chosen approach: define scopes and split traffic by scope

Instead of splitting traffic by user session, we split traffic by interaction scope.

A scope is a domain where users can interact with each other in real time and therefore must share the same version.

We first considered each room as a scope, because users in the same room share real time state (positions, proximity audio groups, meeting rooms). Rooms also have distinct URLs, which would make routing straightforward.

In practice, scopes had to be larger:

users can move between rooms seamlessly within the same world
users can see who is online across the entire world via a shared state structure (WorkAdventure “spaces”) that spans multiple rooms

Because of that, we treat the entire world as the scope.

Why URLs instead of cookies

Cookies are tied to a browser session. A single user can open two tabs connected to two different worlds. Cookie based routing would couple those tabs together in a way that does not match our scope model.

URLs are already organized by scope.

A typical room URL looks like:

https://play.workadventu.re/@/[organization]/[world]/[room]

For white labeled deployments:

https://[custom-domain]/@/[room]

In white labeled mode, a custom domain maps to one and only one world. That gives us a direct scope key to route on:

either the [organization]/[world] segment
or the custom domain

Resulting architecture

At a high level, we duplicate only the components that are directly involved in real time traffic, and we keep the shared storage and administration components common.

Duplicated per environment (Blue and Green):

the frontend serving the client
the realtime backend services handling worlds
the map storage services needed by the running version

Shared between environments:

components not directly on the realtime path, such as the admin API and the storage layer

This split reduces cost while preserving strict consistency inside a world: at any moment, one world is fully on Blue or fully on Green.

Implementation details

Managing internal URLs

A web application is not only composed of the URLs that users access directly.

It also references many internal URLs:

API calls
WebSocket endpoints
static assets (JavaScript, CSS, images)

If a world is routed to Blue, all internal URLs must also resolve to the Blue environment. We accomplish this by having environment specific internal hostnames and ensuring the served page points to the correct ones for its color.

In other words, the public entry URL stays stable, but the application can target environment specific internal endpoints once routing has chosen a color for that scope.

Managing OIDC authentication callbacks

WorkAdventure supports authentication via OpenID Connect.

With OIDC, users are redirected to the identity provider, then sent back to a predefined redirect URI that must be registered ahead of time. In many setups, that redirect URI must be on the same main domain as the app, which prevents using an environment specific hostname for the callback.

We solved this by encoding the environment in the callback path on the main domain:

- https://play.workadventu.re/blue/callback
- https://play.workadventu.re/green/callback

This creates a routing special case: callbacks land on the stable main domain, but the path determines which environment must complete the login flow.

Performing the switch safely

We want the default path to move to the new environment, while keeping each existing world on the old environment until it is safe to switch.

Our switch procedure is:

Deploy the new version to the idle environment (example: Green).
Ensure every existing world is explicitly routed to the current live environment (example: Blue).
Change the default routing so that any world without an explicit rule goes to Green.
For each world, once it is safe to switch, remove its explicit Blue rule so it falls back to the new default (Green).

This requires an orchestration component that understands worlds, activity, and routing rules.

Why we moved away from Ingress and adopted the Kubernetes Gateway API

Initially, routing was managed via Kubernetes Ingress resources. Ingress is convenient for simple setups, but it tends to concentrate a lot of routing logic into a small number of large resources, often one per domain.

At our scale, switching worlds one by one would mean updating a very large Ingress resource repeatedly. We typically have thousand of worlds running at the same time. That creates two problems:

the resource becomes huge and operationally heavy to manage
updates can cause routing reloads that briefly impact unrelated worlds, exactly what we want to avoid

To address this, we stopped relying on Ingress for this workflow and moved routing to the Kubernetes Gateway API. With Gateway API, routing rules can be defined in a more granular way, as separate route resources. That means switching one world can be reduced to updating or deleting one small route resource, without reloading everything.

This change was necessary for fine-grained, low impact routing orchestration.

The orchestration component and activity heuristics

The orchestration component manages routing rules and decides when a world can be switched. Its responsibilities are:

pin every world to the currently live environment at the start of a rollout
switch the default route to the new environment
monitor world activity and unpin worlds when they are considered safe to migrate

The hard part is determining “safe to migrate.”

In WorkAdventure, a world can remain non empty for a very long time because people leave tabs open and keep their avatar connected for days. From a routing point of view, that world is never empty, yet switching it is usually low risk if nobody is actively interacting.

We therefore implemented heuristics to approximate “effective emptiness”:

If a world has only one user, it is treated as empty (no realtime interaction with others is possible).
If a world has multiple users, but none is in a bubble conversation or meeting room, and the new environment has been running for more than 4 hours, the world is treated as empty.
If a world has multiple users, but the current user count is below 10 percent of the maximum observed since the new environment started, the world is treated as empty.
Finally, if the new environment has been running for more than 24 hours, the world is treated as empty.

With these rules, every world eventually migrates within 24 hours without requiring a platform wide interruption.

Limitations and future work

This strategy allows deployments without deliberate downtime at the platform level, while ensuring that each world stays consistent on a single version at any point in time. It also improves operational flexibility: we no longer need to schedule releases only during nights or weekends.

However, there are limitations:

With only two environments, we cannot start rolling out a second new version until the previous rollout is fully completed.
Emergency fixes are harder once the default has moved, because reverting or patching can require either waiting for the rollout window or forcing switches that may disconnect some users.

A natural next step would be exploring Canary style deployments. Adapting Canary to a realtime, shared state system would raise challenges similar to Blue Green, especially around scope definition, consistency, and safe migration triggers.

References: