#Dev

erm Takes a Practical Swing at Local Filler-Word Removal

AI & ML Reporter
6 min read

erm is a local CLI for cutting ums and uhs from recordings, but the interesting part is not the interface. It is the audio engineering required after Whisper gives you imperfect timestamps.

erm is a small Python CLI for a familiar editing problem: removing ums, uhs, ers, and elongated filler sounds from spoken audio without sending recordings to a cloud service. The common path is intentionally plain: uvx erm input.wav reads an input file, writes a cleaned WAV, and emits a JSON cut list. That makes the tool look simple from the outside, but the implementation described in Doug Calobrisi's May 2026 write-up is mostly about why the obvious version fails.

What's claimed is straightforward. erm uses local speech recognition, word-level timestamps, direct audio analysis, and ffmpeg splicing to remove disfluencies while preserving the rest of the speech. It is aimed at podcasters, voice-note users, and anyone editing spoken recordings where manual removal is tedious. The post frames the tool as local-first: audio stays on the user's machine, transcription runs through faster-whisper, and rendering is handled by FFmpeg.

The model story is specific enough to be useful. erm uses OpenAI's Whisper model family through faster-whisper, with medium.en as the default. The author suggests large-v3 when accuracy matters more than compute cost, especially because filler words are exactly the kind of low-information speech that transcription systems often suppress. small.en is available as a faster option. The post does not provide a formal benchmark table for erm itself, which matters. There are no reported word error rates before and after cleaning, no click-detection metric, no human listening study, and no runtime comparison across model sizes on a fixed corpus. The only benchmark-adjacent claim is inherited from faster-whisper, whose project documentation reports speed and memory advantages over OpenAI's reference Whisper runtime. That supports the runtime choice, but it is not an evaluation of erm's editing quality.

What is actually new is not that a transcript can identify um and uh. That baseline would be easy: transcribe the audio, find filler tokens, cut those time ranges, and stitch the remaining audio together. The post is useful because it explains why that baseline sounds bad. Whisper often omits fillers because its training data tends to resemble cleaned transcripts. Even when a filler appears in the transcript, the timestamp boundary may land at an arbitrary point in the waveform. Cutting there creates a discontinuity, which the ear hears as a click. A third problem is more subtle: room tone changes around edits. Even if the speech splice is technically clean, the background noise on the two sides may not match.

erm handles detection with four passes. First, it asks Whisper for word-level timestamps and flags words that match known filler tokens, including stretched forms like ummmm. Second, it checks suspicious gaps between transcribed words. If Whisper reports a long pause but the audio contains voiced sound inside that pause, erm treats that region as a possible deleted filler. Third, it looks for fillers hidden inside a neighboring word, such as a long token where Whisper effectively glued uhhhhh onto an adjacent word. Fourth, it checks words whose duration is too long for the text, then scans the tail for held-vowel behavior. The pitch check is a practical guardrail: slow speech and a sustained uh can both be long, but their acoustic shapes differ.

That design choice is more interesting than a generic AI wrapper. The LLM-era failure mode would be to trust the transcript too much. erm assumes the transcript is a lossy view of the recording, then uses the audio signal to recover what the model hid or blurred. This is the right instinct for production audio tooling. Speech recognition models optimize for readable text, not necessarily for editorial operations at millisecond precision.

The cut refinement is also grounded in old audio facts rather than model magic. erm slides each cut endpoint within a small window to find a quieter nearby spot, then snaps to a zero crossing so the waveform joins without a hard step. It also merges tiny leftover fragments, because a 120 ms island between two edits is more likely to sound like a glitch than intentional speech. Rendering uses an ffmpeg crossfade whose length scales with the size of the removed region, bounded between short and longer fades. Fixed fade lengths are a common beginner mistake: they can smear short edits and still leave longer edits audible.

The room-tone trick is intentionally low-tech. erm finds a quiet stretch from the original recording and loops it quietly under the output. This masks small background mismatches around edits. In professional editing, room tone is a normal concept, not an AI feature. The practical value is that the tool applies it automatically enough for casual recordings.

The denoising mode is another place where the post shows engineering judgment. erm supports none, pre, post, and hybrid. The default, hybrid, detects fillers on the original audio but cuts from a denoised copy. That makes sense because denoising can remove the small volume and pitch cues the detectors need. Running detection after denoising may produce cleaner-looking audio for the algorithm, but it can hide the evidence the algorithm is searching for.

Validation is basic but useful. erm validate input.wav cleaned.wav --cuts cuts.json checks that the output opens, that its duration is shorter by approximately the expected cut length, and that a fresh transcription of the cleaned file does not contain fillers. The last check is the most meaningful because it tests the full pipeline, not just the intermediate cut list. It is still not a substitute for listening tests. A cleaned file can pass transcription validation while sounding unnatural, clipping breaths, or changing cadence.

The limitations are clear. erm does not remove like, you know, I mean, repeated words, false starts, or long thinking pauses. That is a defensible boundary. Ums and uhs are closer to non-lexical sound. Phrases like I mean may be discourse markers, hedges, repairs, or part of the speaker's meaning. Removing them automatically can change tone or intent. The tool is not a full podcast editor, and it should not be treated as one.

There are also practical constraints. Whisper model choice affects both accuracy and compute cost. large-v3 may catch more fillers, but it will be slower and heavier than small.en or medium.en. The pipeline depends on FFmpeg and FFprobe being installed. Audio quality, microphone noise, accents, speaking style, overlapping speakers, music beds, and compression artifacts can all make filler detection harder. The post focuses on English filler tokens, and the default model is medium.en, so multilingual behavior should not be assumed without testing.

The project is best understood as applied speech engineering around an imperfect model. Whisper supplies timestamps and a rough linguistic map. erm adds signal-level heuristics, zero-crossing-aware cuts, adaptive crossfades, room tone masking, and validation. That combination is much more credible than a product claim that transcription alone can clean speech.

For users, the practical path is simple: install uv or use Python packaging, make sure FFmpeg is on the path, then run uvx erm input.wav --dry-run before rendering. For developers, the more useful lesson is architectural: when a model output is used to edit media, the model should not be the only source of truth. The waveform still matters.

Comments

Loading comments...