David Négrier

CTO & Founder

March 30, 2026

Articles

WorkAdventure outage on March 19th, 2026

Thursday March 19th 2026, between 9:30 and 10:00 CEST, the WorkAdventure platform was down or unstable. This has impacted all our customers. We know how WorkAdventure can become critical in day to day communication with your remote team / students / users. In this article, we share what happened, why the platform collapsed, what we learned along the way and how we improved WorkAdventure to make it even more resilient.

TL;DR: unnecessary requests to Livekit and a hidden race condition

On March 19th, one of our clients held a big meeting on WorkAdventure. 900 people were gathered to attend a meeting that started with a speaker addressing to all the attendees, using the “speaker zone” (aka podium) mechanism.

The WorkAdventure platform is built on two core components: our own servers for managing participant connections and position sharing, and dedicated Livekit servers (both our internal cluster and Livekit Cloud) for handling high-scale audio/video streams.

The stability issue was triggered by two independent software bugs:

A Bug in Livekit API Calls: Unnecessary requests to Livekit to forcibly disconnect participants (a security best practice) rapidly hit a poorly documented Livekit API rate limit. This prevented the main event podium from starting.
A Cascading Race Condition: An attempt to resolve the first issue by deleting and recreating the Livekit room triggered a massive spike in API calls (estimated at 45,000 requests). This saturated one of our back servers, which subsequently timed out its connection with the play (front-end) servers. This disconnect exposed a previously unseen race condition in our Node.js code, causing the play servers to crash in a cascade, leading to mass disconnections.

Details on the bug with Livekit API

Users connect to WorkAdventure, and when they enter a meeting room or when a podium starts, the WorkAdventure server starts a Livekit “room” that browsers then connect to.

We use the “Livekit Server API” to create Livekit “rooms”, but also to forcibly disconnect participants once they have left a room in WorkAdventure. We could leave that responsibility to the client, meaning the browser, and let it disconnect on its own, but a good security practice is not to trust the client. A malicious user could keep their connection token for a Livekit room and use it to spy on what is happening there after they have left. That is why WorkAdventure called the Livekit server to remove participants when they leave a WorkAdventure room.

The root cause of the issue was a runaway number of participant removal calls sent to Livekit. A bug in our code was triggering unnecessary calls every time someone left an audience area when no one was on the podium. Furthermore, too many calls to delete the room were also performed when a speaker was leaving the podium.

These unnecessary calls were not catastrophic in themselves, but they piled up. When a speaker got off the podium and on the podium shortly after, we sent many “delete Livekit room” message (instead of only one), and one “create Livekit room” message. We think a race condition on the Livekit side (maybe due to the fact Livekit rooms can start on any of their servers in the cloud) generated a re-ordering of the messages. Instead of seeing “delete-delete-delete-create”, Livekit executed “delete-delete-create-delete”. So the podium was destroyed right after being created.

This is visible in the Livekit Cloud logs:

Why we had never encountered the issue before

The problem occurs if:

A large number of people are in the “audience” area
There is no one in the “speaker” area or the speaker goes back and forth quickly
A large number of people move back and forth between the “audience” area and the outside
Livekit Cloud is used rather than our internal Livekit cluster

This specific case had never happened to us before. Usually, speakers step onto the podium quite soon after the event begins.

The Chronology of the Incident

Before the crash: Load tests

We are always looking at our platform limits and generally, most components of the WorkAdventure platform can auto-scale. They adapt automatically to the load by adding more servers when needed.

The week before the crash, we conducted a load test with 800 simulated users, using real Chrome browsers to accurately mimic user behavior. If you are curious, we use Artillery plus Playwright deployed on AWS Fargate. This test was designed to validate scaling, but it missed the critical bug: in our test, the speaker was always the first person to enter the podium, and more generally, users entered the auditorium, but never left, meaning the specific conditions required to trigger the bug were never met.

The day of the event

9:05 AM UTC: First Warning

At 9:05 AM, we received an email from Livekit stating that we had reached 80% of a 10,000 requests-per-minute API limit. This was the first time we learned of this quota. The limit is documented in the Quotas & Limits (they claim is it 1.000 reqs/minute, but there is actually a bigger undocumented limit for some accounts). But it is not visible in the “Plan quotas” dashboard (where you would be looking for limits in the first place):

Although the limit was not fully documented or actionable (no button to request an increase), we immediately began planning a migration to our self-hosted Livekit cluster, which does not have this API limit.

The first problem became apparent shortly after: when the speakers attempted to start the podium, the room creation request failed because the rate limit had been exhausted, resulting in a persistent “loader” on the client side.

Root Cause 1: The API Rate Limit Bug

WorkAdventure uses the Livekit Server API to securely create rooms and, importantly, to forcibly disconnect users for security when they leave a designated room area. This prevents malicious users from keeping their connection token active.

The bug was triggered when a participant exited the audience zone while no speaker was present on the podium. In this specific scenario, our application incorrectly initiated the participant deletion process.

Due to the large event map and long speaker breaks during the event, many users were constantly moving between the audience zone and the entrance for a 20-minute period. Every time a single user left the area, it generated a large number of unnecessary Livekit API deletion requests, leading to rate limit exhaustion. Once the limit was reached, Livekit Cloud stopped processing requests, including the critical request to create the podium room when the speakers finally stepped up.

Amplification and Cascading Failure

When the podium failed to start, the team attempted to fix the issue by switching to our internal Livekit cluster. This required deleting and recreating the podium/audience zone.

At that moment, approximately 300 users were in the zone, and the podium was still empty. The act of deleting the zone caused every single one of those 300 users to “exit” simultaneously, re-triggering the bug:

The first exiting user generated 300 Livekit API requests.
The second user generated 299 requests.
The total surge amounted to approximately 45,000 requests (300*300/2).1

This extreme load saturated the relevant back server (workadventure-back-5).

Because the requests were being rejected by Livekit, it caused additional logs to be emitted and sent tou our log servers, causing even more strain on the servers. We observed the P99 of the Node.js event loop lag (which is typically milliseconds) spike to over 12 seconds.

Root Cause 2: The Race Condition Crash

The server saturation led directly to mass disconnections due to a second, more critical bug related to inter-server communication timeouts.

Our play (front) and back (data) servers communicate via a gRPC stream, with a watchdog mechanism: the play server monitors for a ping from the back every 60 seconds. Since the back-5 server was overloaded, it stopped responding within this limit.The Crash Path

When the play server detected the 60-second timeout on back-5, it initiated a cleanup process to destroy all spaces attached to that back server. This cleanup process exposed a race condition in the code (SpaceConnection.ts):

The play component detects a 60-second timeout for the back server.
The cleanup process starts.
The connection is destroyed, and the primary error event handler is unregistered.
Crucially, the code simultaneously attempts to send a final leaveSpace() message to the already-closed connection.
Attempting to write to the terminated stream generates an ERR_STREAM_WRITE_AFTER_END error.
In a normal scenario, the error is logged and nothing wrong happens. But since the error handler had just been unregistered, this error was emitted as an unhandled exception, causing the entire Node.js process to crash and terminate, disconnecting all attached users.

This failure propagated, causing the Kubernetes play containers to crash repeatedly, leading to the loop of disconnections and reconnections experienced by users.

Corrective Actions and Future Hardening

We have implemented comprehensive fixes and made architectural changes to ensure this specific incident cannot recur, even under unexpected load patterns.

1. Reproducing the crash

We evolved our load tests to reproduce more faithfully what happens in a real meeting. Test users now randomly enter and leave the auditorium.

2. Eliminating the too many requests to the Livekit API

We immediately deployed a patch that removes the unnecessary participant disconnection messages, ensuring that Livekit API calls only happen when a speaker leaves the podium, not when an audience member moves.

Validation: New performance tests were developed that explicitly reproduce the original bug (users moving in an empty auditorium). The tests showed that with the patch, a simulation of 800 users maintains full podium functionality in all situations.

Resolving the Race Condition

We addressed the cascading crash by implementing a patch that correctly handles the cleanup sequence: a simple “error” handler is temporarily added after the main handler is removed. This ensures that when the ERR_STREAM_WRITE_AFTER_END error is triggered during cleanup, it is caught and handled gracefully instead of causing a process crash.

To validate the fix for the race condition, we implemented a new debug feature to artificially slow down a back server, simulating the exact event loop lag observed during the incident. This allowed us to successfully reproduce the 60-second timeout and subsequent play container crash in a test environment, validating our race condition patch.

Sending feedback to the Livekit team

We opened a formal ticket with Livekit to request that they update their platform dashboard and documentation to clearly display the existing Server API rate limit. Livekit has confirmed this will be taken into account by their development team.

A final word

We deeply regret the disruption experienced by all our users. Despite the extra care we took verifying and load-testing the application, we were not capable of detecting those 2 bugs before they occurred in production.

We have of course applied immediate and targeted fixes. More importantly, we now have better load tests that can catch this type of issues, so should those occur again in the future, we will see them before the code goes to production. No automated load test will ever reproduce the exact behaviour of thousands of users, but what we have today is closer to the reality.

Ready to create your virtual world?

Set up your WorkAdventure space for free, no credit card required.

Get started for free