Migrating from gather.town?
Get a discount!
David Négrier
CTO & Founder

Why Uptime Percentage Alone Is a Bad SLA for Real Time Apps Like WorkAdventure

Service Level Agreements are often summarized with a single number: uptime percentage. It is simple, it is easy to compare across vendors, and it has become the default way to communicate reliability.

But if you build or operate a real time product like WorkAdventure, that single number can hide the exact failure mode your users care about most.

The classic SLA metric: uptime percentage

Most SLAs revolve around availability over a given period.

A popular target is 99.99% uptime, often seen as the gold standard for production grade services.

What does 99.99% mean in human terms?

A year has 365 days:

  1. 365 days × 24 hours = 8760 hours
  2. 8760 hours × 60 minutes = 525,600 minutes
  3. 0.01% of 525,600 minutes = 52.56 minutes

So 99.99% uptime allows about 52 minutes of downtime per year.

So far, so good.

Now comes the part that matters: how that downtime happens.

Same uptime, totally different user experience

Let’s take two systems that both meet 99.99% uptime.

Scenario 1: one big outage

There is a single outage of 52 minutes in the year.

It is painful, but it is a clear event. Teams can communicate, incident response kicks in, users can plan around it, and most importantly it is not constantly breaking trust.

Also, it might happen when many users are asleep, offline, or simply not using the product.

Scenario 2: thousands of tiny outages

Now imagine the same total downtime, but split into 1 second chunks spread across the year.

52 minutes equals 3120 seconds.

That means 3120 separate interruptions in a year.

3120 interruptions Ă· 365 days ≈ 8.5 interruptions per day.

Same uptime percentage. Completely different reality.

Why tiny outages barely matter for many web apps

For a typical website or classic CRUD application, scenario 2 might be annoying, but often it is survivable.

Most of the time:

  1. The browser is not continuously connected to the server
  2. The user reads, scrolls, thinks, types
  3. Requests happen in bursts: page load, API call, form submit

If the server disappears for one second while someone is reading an article, nothing happens on screen. No one notices.

And if a request fails, the usual patterns save the day:

  1. Refresh the page
  2. Retry the request
  3. Blame the WiFi, not the backend

Even if the issue is technically “downtime”, users experience it as a minor glitch.

Why tiny outages are a big deal for WorkAdventure

WorkAdventure is not a “request, response, done” kind of application.

It is a real time environment where a lot of value depends on continuous connectivity:

  1. WebSocket connections to share presence and live state
  2. Real time chat and events
  3. Audio and video streams for meetings and talks
  4. A shared space where interactions happen live, not after a refresh

In this world, a one second outage is not invisible.

It breaks the session.

It drops connections.

It interrupts audio and video.

It triggers user facing “something went wrong” signals.

Now replay scenario 2: around eight or nine interruptions per day.

That is not an SLA users will perceive as “99.99%”. That is a product that feels unstable.

And once users start anticipating interruptions, they stop trusting the platform for important moments: meetings, onboarding sessions, live events, training, customer demos.

Scenario 1 is often forgivable.

Scenario 2 is a slow confidence killer.

This also impacts Ops: host providers optimize for uptime, not interruptions

This isn’t only a product problem, it is also an operations and vendor selection problem.

Most hosting companies and infrastructure providers (whether it is OVH, Hetzner, AWS, or others) structure their own contracts, dashboards, and incident communications around uptime. That means short interruptions are often treated as “acceptable noise” as long as the overall uptime budget stays healthy.

A concrete example: a provider may restart a load balancer and count it as a one second downtime. From a contract perspective, that is negligible when the yearly budget is around 52 minutes at 99.99%.

In practice, this can happen regularly. For instance, we have seen brief service interruptions that look like a load balancer restart about once a week on our OVH Kubernetes cluster. From the provider standpoint, everything is normal: the uptime target is still met.

For WorkAdventure, however, these events are user visible. The hard part is that hosts rarely document an “interruption frequency budget”, so when you pick a host, it is difficult to make an informed comparison on the metric that actually matters for real time experiences.

What WorkAdventure users actually care about

For real time products, reliability is not only about total downtime.

It is also about session continuity.

Two questions matter more than “how many minutes were we down this year”:

  1. How often does my session get interrupted?
  2. When it happens, how disruptive is it?

This is exactly why uptime percentage alone is a weak SLA metric for real time collaboration and game like systems.

When one second interruptions are inevitable, software can reduce the pain

In the real world, one second interruptions are sometimes unavoidable (network jitter, load balancer restarts, brief routing issues, node maintenance, and so on).

When we cannot fully prevent them at the infrastructure level, we can work around them in the product.

One important strategy is to reconnect silently first.

Instead of immediately showing a loud error to the user, the client can attempt a quick retry loop for a few seconds, aiming to restore the connection before the interruption becomes noticeable.

That sounds simple, but it comes with its own set of challenges. Once connectivity is back, the client must rebuild state safely:

  1. Resynchronize the world state
  2. Reposition every user on the map
  3. Handle users who moved while you were disconnected
  4. Handle users who connected or disconnected during the gap
  5. Recover chat and real time signals without duplications or missing events

Done well, this can turn a one second outage into something users never notice. Done poorly, it can create confusing “teleportation”, desync, or duplicate events.

So far, we don’t have such a resynchronization of the world designed in WorkAdventure. But we know we need to target this in the coming months.

Version upgrades are also causing downtime

This also changes how you think about maintenance and upgrades.

In the past, we used to perform upgrades at night, when almost no one was connected.

Today, WorkAdventure is used all over the world. There is no real quiet moment anymore. When European users are asleep, users in the Americas are awake. And when the Americas finally go offline, users in Asia are starting their workday.

When we performed upgrades in the past, we focused on reducing total downtime as much as possible.

We managed to reduce upgrade related downtime from 5 minutes to 20 seconds.

That is a great operational improvement.

But from a user’s perspective, both outcomes can still feel the same: the session is interrupted.

Whether it is 5 minutes or 20 seconds, the experience is “my meeting stopped”.

For a real time product, the real win is not “shorter interruption”.

The real win is no interruption at all.

And that is a much harder challenge when many users are connected simultaneously and expect a continuous real time experience.

What a better SLA could look like for real time apps

We believe SLAs for products like WorkAdventure should include availability, but also reliability signals tied to continuity.

Here are examples of the kinds of commitments that match what users actually feel:

  1. Interruption frequency
    Example: maximum number of disconnect events over a period
  2. Interruption duration
    Example: maximum length of a service interruption before recovery
  3. Session impact
    Example: percentage of sessions that complete without a disconnect
  4. Recovery behavior
    Example: reconnect time targets and how quickly real time features resume

You can still keep uptime percentage.

But you should pair it with an “interruption budget”, because that is what separates a product that is technically up from a product that feels dependable.

What comes next

In the next articles, we will explore how we plan to push toward upgrades and operations that do not interrupt ongoing sessions, even at scale, even with real time constraints.

Because for WorkAdventure, reliability is not only about being online.

It is about staying with you, continuously, when it matters.

You may also be interested in