Why Uptime Percentage Alone Is a Bad SLA for Real Time Apps Like WorkAdventure
Service Level Agreements are often summarized with a single number: uptime percentage. It is simple, it is easy to compare across vendors, and it has become the default way to communicate reliability.
But if you build or operate a real time product like WorkAdventure, that single number can hide the exact failure mode your users care about most.
The classic SLA metric: uptime percentage
Most SLAs revolve around availability over a given period.
A popular target is 99.99% uptime, often seen as the gold standard for production grade services.
What does 99.99% mean in human terms?
A year has 365 days:
- 365 days Ă 24 hours = 8760 hours
- 8760 hours Ă 60 minutes = 525,600 minutes
- 0.01% of 525,600 minutes = 52.56 minutes
So 99.99% uptime allows about 52 minutes of downtime per year.
So far, so good.
Now comes the part that matters: how that downtime happens.
Same uptime, totally different user experience
Letâs take two systems that both meet 99.99% uptime.
Scenario 1: one big outage
There is a single outage of 52 minutes in the year.
It is painful, but it is a clear event. Teams can communicate, incident response kicks in, users can plan around it, and most importantly it is not constantly breaking trust.
Also, it might happen when many users are asleep, offline, or simply not using the product.
Scenario 2: thousands of tiny outages
Now imagine the same total downtime, but split into 1 second chunks spread across the year.
52 minutes equals 3120 seconds.
That means 3120 separate interruptions in a year.
3120 interruptions Ă· 365 days â 8.5 interruptions per day.
Same uptime percentage. Completely different reality.

Why tiny outages barely matter for many web apps
For a typical website or classic CRUD application, scenario 2 might be annoying, but often it is survivable.
Most of the time:
- The browser is not continuously connected to the server
- The user reads, scrolls, thinks, types
- Requests happen in bursts: page load, API call, form submit
If the server disappears for one second while someone is reading an article, nothing happens on screen. No one notices.
And if a request fails, the usual patterns save the day:
- Refresh the page
- Retry the request
- Blame the WiFi, not the backend
Even if the issue is technically âdowntimeâ, users experience it as a minor glitch.
Why tiny outages are a big deal for WorkAdventure
WorkAdventure is not a ârequest, response, doneâ kind of application.
It is a real time environment where a lot of value depends on continuous connectivity:
- WebSocket connections to share presence and live state
- Real time chat and events
- Audio and video streams for meetings and talks
- A shared space where interactions happen live, not after a refresh
In this world, a one second outage is not invisible.
It breaks the session.
It drops connections.
It interrupts audio and video.
It triggers user facing âsomething went wrongâ signals.
Now replay scenario 2: around eight or nine interruptions per day.
That is not an SLA users will perceive as â99.99%â. That is a product that feels unstable.
And once users start anticipating interruptions, they stop trusting the platform for important moments: meetings, onboarding sessions, live events, training, customer demos.
Scenario 1 is often forgivable.
Scenario 2 is a slow confidence killer.
This also impacts Ops: host providers optimize for uptime, not interruptions
This isnât only a product problem, it is also an operations and vendor selection problem.
Most hosting companies and infrastructure providers (whether it is OVH, Hetzner, AWS, or others) structure their own contracts, dashboards, and incident communications around uptime. That means short interruptions are often treated as âacceptable noiseâ as long as the overall uptime budget stays healthy.
A concrete example: a provider may restart a load balancer and count it as a one second downtime. From a contract perspective, that is negligible when the yearly budget is around 52 minutes at 99.99%.
In practice, this can happen regularly. For instance, we have seen brief service interruptions that look like a load balancer restart about once a week on our OVH Kubernetes cluster. From the provider standpoint, everything is normal: the uptime target is still met.
For WorkAdventure, however, these events are user visible. The hard part is that hosts rarely document an âinterruption frequency budgetâ, so when you pick a host, it is difficult to make an informed comparison on the metric that actually matters for real time experiences.
What WorkAdventure users actually care about
For real time products, reliability is not only about total downtime.
It is also about session continuity.
Two questions matter more than âhow many minutes were we down this yearâ:
- How often does my session get interrupted?
- When it happens, how disruptive is it?
This is exactly why uptime percentage alone is a weak SLA metric for real time collaboration and game like systems.
When one second interruptions are inevitable, software can reduce the pain
In the real world, one second interruptions are sometimes unavoidable (network jitter, load balancer restarts, brief routing issues, node maintenance, and so on).
When we cannot fully prevent them at the infrastructure level, we can work around them in the product.
One important strategy is to reconnect silently first.
Instead of immediately showing a loud error to the user, the client can attempt a quick retry loop for a few seconds, aiming to restore the connection before the interruption becomes noticeable.
That sounds simple, but it comes with its own set of challenges. Once connectivity is back, the client must rebuild state safely:
- Resynchronize the world state
- Reposition every user on the map
- Handle users who moved while you were disconnected
- Handle users who connected or disconnected during the gap
- Recover chat and real time signals without duplications or missing events
Done well, this can turn a one second outage into something users never notice. Done poorly, it can create confusing âteleportationâ, desync, or duplicate events.
So far, we don’t have such a resynchronization of the world designed in WorkAdventure. But we know we need to target this in the coming months.
Version upgrades are also causing downtime
This also changes how you think about maintenance and upgrades.
In the past, we used to perform upgrades at night, when almost no one was connected.
Today, WorkAdventure is used all over the world. There is no real quiet moment anymore. When European users are asleep, users in the Americas are awake. And when the Americas finally go offline, users in Asia are starting their workday.
When we performed upgrades in the past, we focused on reducing total downtime as much as possible.
We managed to reduce upgrade related downtime from 5 minutes to 20 seconds.
That is a great operational improvement.
But from a userâs perspective, both outcomes can still feel the same: the session is interrupted.
Whether it is 5 minutes or 20 seconds, the experience is âmy meeting stoppedâ.
For a real time product, the real win is not âshorter interruptionâ.
The real win is no interruption at all.
And that is a much harder challenge when many users are connected simultaneously and expect a continuous real time experience.
What a better SLA could look like for real time apps
We believe SLAs for products like WorkAdventure should include availability, but also reliability signals tied to continuity.
Here are examples of the kinds of commitments that match what users actually feel:
- Interruption frequency
Example: maximum number of disconnect events over a period - Interruption duration
Example: maximum length of a service interruption before recovery - Session impact
Example: percentage of sessions that complete without a disconnect - Recovery behavior
Example: reconnect time targets and how quickly real time features resume
You can still keep uptime percentage.
But you should pair it with an âinterruption budgetâ, because that is what separates a product that is technically up from a product that feels dependable.
What comes next
In the next articles, we will explore how we plan to push toward upgrades and operations that do not interrupt ongoing sessions, even at scale, even with real time constraints.
Because for WorkAdventure, reliability is not only about being online.
It is about staying with you, continuously, when it matters.