Building an easy-to-use browser background noise detector








In the past months, we spent a significant amount of time setting up noise suppression for WorkAdventure. When the feature was released, we realized that we needed to do more and answer one important question: how does a user know when they need to enable noise suppression?
The people on the other side of a call hear your microphone. You generally do not. A fan, traffic through an open window, or the activity of a busy cafe can therefore disturb a meeting without the person producing the noise realizing it.
This often leads to the same conversation at the beginning of a call:
- “I hear some background noise. Where is it coming from?”
- “Could everyone mute for a moment?”
- “Can you enable noise suppression?”
We wanted the application to identify the likely source first. If a microphone is carrying loud background noise while its owner is not speaking, the user can be notified privately and offered a useful action: mute the microphone or enable noise suppression.
That sounds like a simple audio-level check. It was not. This article explains the approaches we tried, why a lightweight WebRTC voice activity detector failed on the signal we cared about, and why the final implementation combines Silero VAD with audio energy and time-window rules behind a small MediaStream API.
You can try the background noise detector in your browser and explore the complete workadventure/noise-suppression source code on GitHub.
Noise detection and noise suppression are different jobs
Our existing @workadventure/noise-suppression package uses DTLN to remove noise from a live audio stream. The DTLN models and their LiteRT runtime run in an AudioWorklet, where they continuously transform the audio before it is sent to other participants.
Background noise detection has a different purpose. It does not need to produce clean audio. It only needs enough information to say: “this microphone has been loud for a while, but the signal is unlikely to be speech.”
That difference matters because detection should be available before noise suppression is enabled. Loading our complete DTLN path for that decision would mean downloading about 17 MB of model and runtime assets, consuming additional memory, and continuously spending CPU time on denoising. It would effectively enable most of noise suppression just to decide whether to recommend noise suppression.
Enabling the denoiser by default was not the right answer either. Any speech enhancement model can alter the signal it processes. This is especially noticeable for quiet voices or sounds that differ from the speech on which the model was trained. Noise suppression also has a real CPU cost on low-end machines.
We therefore set three constraints for the detector:
- it should remain independent from the DTLN models and runtime;
- it should be cheap enough to run whenever the microphone is active;
- it should avoid warning users while they are simply talking.
The last constraint is the hard one. High audio energy is easy to measure. Deciding whether that energy belongs to speech is not.
First attempt: look for a microphone that never becomes quiet
Our first idea did not require a machine learning model at all.
For each short frame of microphone audio, we can calculate its root mean square (RMS). RMS is a useful measure of signal energy: silence is close to zero, while louder audio produces a higher value. If the RMS never drops below a threshold for several seconds, the microphone probably contains persistent background noise.
This works for an idealized source such as a steady fan or electrical hum. It also appears to provide a way to distinguish noise from speech. Ordinary speech contains short pauses, so its level repeatedly drops between words and sentences. A constant noise floor does not.
The problem is that real noisy environments are rarely constant.
In a cafe, cups hit tables, chairs move, and several conversations overlap. Near a road, cars arrive and leave. Keyboard sounds form a series of short impulses. Even a fan changes as the microphone’s automatic gain control reacts to speech and silence.
A strict “never quiet” rule misses much of this noise. Relaxing the rule enough to catch intermittent sounds makes it much easier to flag normal speech. We could keep adding thresholds and timing heuristics, but the detector would become highly dependent on microphone gain, browser processing, and the user’s speaking style.
Audio level could tell us that something was happening. It could not reliably tell us what it was.
The Jitsi idea: high energy without speech
The more robust direction came from Jitsi’s noisy-microphone detection work. Jitsi has used RNNoise in WebAssembly for noise detection as well as noise suppression. The general idea is more useful than any particular implementation:
- measure whether the input is loud enough to matter;
- use voice activity detection (VAD) to estimate whether someone is speaking;
- warn only when the input stays loud without enough evidence of speech.
This changes the problem from “recognize every kind of noise” to “find sustained audio that is probably not voice.” We do not need to classify traffic, fans, music, or keyboard sounds individually. We need a good speech discriminator and a conservative aggregation rule.
We considered reusing RNNoise directly. It is a credible option and already proven in Jitsi, but it would add another complete denoising stack to a package that already contains DTLN. Our detector only needed voice activity plus raw audio level, so we first looked for a smaller VAD.
Choosing the lightweight option: libfvad
libfvad packages the voice activity detector from WebRTC as a standalone C library. It consumes short PCM frames and returns a binary result: speech or no speech. It is small, fast, runs entirely on the client, and does not require a neural-network runtime.
Those properties matched our operational constraints very well.
The available JavaScript wrappers were less convincing. Some were archived; others had little adoption and appeared unmaintained. Rather than expose an abandoned dependency as part of our public API, we decided to vendor libfvad, compile the small Wasm artifact ourselves, and keep it behind a project-owned TypeScript interface. We recorded that choice, including the alternatives and expected tradeoffs, in ADR 0008.
We then built the complete detector around it:
- a reusable libfvad Wasm wrapper;
- a detector core combining binary VAD decisions with RMS windows;
- a dedicated background-noise
AudioWorkletProcessor; - a public worklet entrypoint and packaged assets;
- browser tests and a demo for microphone and audio-clip inputs.
Keeping this detector in a separate worklet from DTLN ensured that enabling it did not initialize the noise-suppression models. Reusable buffers and no per-frame option validation kept unnecessary work out of the audio hot path.
Architecturally, it was a clean implementation. Then we tested its actual classification behavior.
Why libfvad failed for this use case
Our noise-only fixtures exposed the problem immediately. Both pure-noise.wav and a shortened white-noise sample were classified as speech for nearly every frame, in every libfvad aggressiveness mode.
At first, a Wasm integration error seemed plausible. Audio code crosses several boundaries: floating-point Web Audio samples must become 16-bit PCM, frames must have an accepted sample rate and duration, and the correct memory region must be passed to the C function. A mistake in any of those steps could produce meaningless VAD decisions.
We therefore bypassed Wasm and ran the same fixtures through native libfvad. The result was the same. The wrapper was not the problem.
The result makes sense once we consider what WebRTC VAD is designed to do. It is an online, energy-oriented detector with adaptive estimates of foreground and background audio. It analyzes several frequency sub-bands using a Gaussian mixture model whose parameters favor speech-like bands. That makes it useful for many real-time communication tasks, but loud broadband or speech-band noise can look like foreground voice.
Our feature depends on the opposite judgment. We specifically need a VAD that can say “this loud signal is not speech.” A binary detector that labels our representative noise fixtures as speech gives the aggregation layer no useful signal. No amount of careful RMS windowing can recover information that the VAD does not provide.
This was an important engineering result. libfvad was the better architecture on paper: compact, fast, self-contained, and easy to run in an AudioWorklet. But model behavior mattered more than integration elegance. We removed the wrapper, the Wasm artifact, and the vendored source instead of retaining an unused fallback.
Moving to Silero VAD
We replaced libfvad with Silero VAD, integrated through @ricky0123/vad-web. Silero is a neural voice activity detector designed to work across varied languages, recording conditions, and background noise. More importantly for our detector, it exposes a speech probability for each frame instead of only a binary decision.
ADR 0009 documents why this replaced the libfvad decision and what the larger runtime means for the package.
That probability gives the aggregation layer more information. A single frame does not force an immediate yes-or-no conclusion. Over an analysis window, we can consider:
- the average RMS;
- the ratio of frames above the speech threshold;
- the average speech probability;
- the highest observed speech probability;
- how much of the window contains active audio.
The current detector starts a candidate window when a frame crosses the trigger RMS and is not classified as speech. After 1.5 s, it emits a background-noise-detected event only if the complete window remains loud enough and stays below the configured speech limits. A 15 s cooldown prevents the application from repeatedly notifying the user about the same environment.
These are deliberately windowed, conservative rules. A noisy-microphone warning does not need sample-accurate timing. It needs enough evidence to make a useful suggestion without interrupting ordinary conversation.
Silero has a cost. Its model is roughly a few megabytes, and browser inference also requires ONNX Runtime Web and its Wasm assets. That is heavier than libfvad. We accepted the footprint because the lighter model did not solve the classification problem, and we kept the detector in a separate package entrypoint so applications only initialize these assets when they enable noise detection.
Realizing that our own AudioWorklet was unnecessary
Our first Silero implementation preserved the architecture built for libfvad: the package still exposed a dedicated background-noise AudioWorklet API.
That boundary no longer matched the work being performed.
With libfvad, inference was synchronous and small enough to live directly in our processor. With @ricky0123/vad-web, the library already handles audio capture and framing. Its MicVAD API supports custom audio streams through getStream, so despite the name it can process any supplied MediaStream, not only a stream it opens from the microphone.
Our worklet had become a pass-through layer around another library’s pipeline. It added initialization messages, node lifecycle, asset wiring, and another public abstraction without owning the actual inference. We removed it.
The final public API accepts the stream the application already owns:
import { createBackgroundNoiseDetector, isBackgroundNoiseDetectedMessage, observeBackgroundNoiseDetectorMessages, } from "@workadventure/noise-suppression/background-noise"; const detector = await createBackgroundNoiseDetector( audioContext, microphoneStream ); const stopObserving = observeBackgroundNoiseDetectorMessages( detector, (message) => { if (isBackgroundNoiseDetectedMessage(message)) { // Offer to mute the microphone or enable noise suppression. showNoisyMicrophoneWarning(); } } ); await detector.ready; // When detection is no longer needed: stopObserving(); detector.dispose();
The caller remains responsible for the MediaStream and AudioContext. Disposing the detector releases its own VAD resources but does not stop the microphone tracks or close the context.
This API also works for non-microphone sources. An application can pass a stream captured from a media element or mirror part of an existing Web Audio graph into a MediaStreamAudioDestinationNode. That capability made it possible to use the same public integration in the demo and in tests with deterministic audio fixtures.
There is one subtle distinction: @ricky0123/vad-web may use its own small helper worklet to capture and frame audio. Silero inference runs outside the audio render callback. What we removed was our package’s redundant public background-noise worklet, not every internal use of Web Audio worklets.
Packaging matters as much as detection
An algorithm is not yet an easy-to-use browser library. Silero, its ONNX model, the VAD helper worklet, and ONNX Runtime’s Wasm files must all resolve correctly after an application bundles and deploys the package.
The background-noise entrypoint therefore ships with those assets under the package’s dist/vendor/ directory and resolves their default locations relative to the JavaScript module. Applications with a different asset pipeline can override the Silero and ONNX Runtime base paths.
This preserves the separation we wanted from the beginning:
- importing noise suppression does not start Silero or ONNX Runtime;
- importing background-noise detection does not start DTLN or LiteRT;
- applications opt into either feature independently;
- both features expose browser-oriented APIs instead of model-runtime details.
It is a slightly larger detector than we initially hoped to build, but its cost is explicit and lazy rather than hidden in the default application path.
Testing the signal, not only the plumbing
The libfvad experiment changed how we approached validation. Unit tests for RMS calculation, timing windows, thresholds, and cooldown behavior are necessary, but they cannot prove that a VAD assigns useful probabilities to real audio.
We added browser integration tests using actual fixtures:
- a pure-noise clip must produce a background-noise event;
- a white-noise clip must produce a background-noise event;
- a clean, sufficiently loud voice clip must not produce one.
The tests run the public stream API in Chromium through Vitest Browser Mode. Audio files are decoded into an AudioBuffer, connected to a MediaStreamAudioDestinationNode, and passed to the detector exactly as an application would pass a microphone stream. This exercises Web Audio, the helper worklet, model and Wasm asset loading, ONNX inference, our aggregation rules, and the public event API together.
These fixtures also helped tune an important detail. Silero’s probability for the pure-noise sample varies somewhat with real-time frame alignment. A test that assumed every noise frame had a near-zero speech probability was too strict. The final thresholds allow that variation while still separating the noise fixtures from the clean voice fixture.
This is not a claim that three clips represent every microphone and acoustic environment. Noise detection remains advisory, and field behavior should guide future threshold tuning. The tests establish a more useful baseline: the exact failure that invalidated libfvad cannot silently return, and ordinary clean speech is covered as a negative case.
Where we ended up
The final detector is built from four pieces:
- Silero VAD estimates speech probability from framed audio.
- RMS measures whether the input is loud enough to matter.
- A sustained analysis window rejects isolated frames and speech-heavy input.
- A cooldown turns repeated detections into an application-friendly advisory event.
The package exposes that pipeline as a MediaStream API and keeps its runtime assets separate from DTLN noise suppression. Consumers do not need to create a custom AudioWorklet node, manage ONNX sessions, resample frames, or understand the detector’s internal state machine.
The route to this design was not direct. A level-only heuristic was too fragile for irregular environments. libfvad was operationally ideal but classified our loud noise as speech. The first Silero architecture carried forward a custom worklet that no longer served a purpose. Each discarded approach clarified a different part of the real requirement.
The main lesson is that background noise detection is not simply a quieter form of noise suppression. It is a decision problem with asymmetric consequences: missing occasional noise is acceptable, while repeatedly warning a person who is speaking is disruptive. The useful implementation combines a capable VAD with conservative signal and timing rules, then packages the result so an application can act on one event.
That is where we ended up: not with the smallest detector we could compile, but with the smallest public abstraction that reliably solved the problem we could test.
Ready to create your virtual world?
Set up your WorkAdventure space for free, no credit card required.