When Speech Recognition Gets Too Helpful for Language Learning

May 7, 2026

By: Seongjin Park, Machine Learning Engineer, Voice

At Speak, we think deeply about what the best speech transcription would be for somebody learning a language. . We'll focus just on automatic speech recognition (ASR) models here - the model that converts the audio to a written transcript.

In most products, the answer is simple: turn noisy speech into clean text. If someone hesitates, restarts, or speaks with a heavy accent, a polished transcript feels like success. That's the right call when downstream systems care about intent. A voice assistant should still trigger the weather tool when you say "wha's the wathe like."

For language learning, the same behavior's a problem.

A learner doesn't always need a readable transcript. What they do need is feedback that reflects what they actually said. However, modern speech recognition is so good at recovering intended meaning that it erases the cues we need to teach.

Two kinds of speech recognition

There are two kinds worth naming up front:

Intent-based recognition asks: what did the speaker mean? It produces the most likely clean sentence given the audio. This kind of transcript is also called the normalized one.
Surface-level recognition asks: what did the speaker actually produce? It preserves hesitations, repetitions, partial words, grammar mistakes, and pronunciation errors.

Almost every modern ASR model optimizes for intent transcription. Most learning products quietly assume surface-level. But if you mistakenly useUsing an intent model for language learning, it will fail in several serious ways.

What gets lost

A learner says "she coming" and the system outputs "she is coming." Or they say "I part- participated in the beta, uh, beta program" and the system cleans it up to "I participated in the beta program."

From the usual model usage (intent based), that's a win. The ASR evaluation dataset often use native speech that's correct and fluent (i.e., without any disfluencies), so a normalized transcript will score better than a surface-level transcript. The normalized transcript is more fluent, more grammatical, easier to read.

From a teaching perspective, however, it's very different. The missing verb, the false start, the repetition, the filler: those aren't noise.They're the clearest evidence of where the learner struggled.

This goes beyond just pronunciation. Yes, when a learner says something closer to "sheep" and the system confidently outputs "ship," that's important to know. But the same erasure happens with grammar (omitted function words), fluency (mid-sentence restarts), and self-correction ("I thin-, thought…"). All of these aspects are meaningful in a language-learning setting. All of them disappear when a model is trained to produce the most likely final sentence rather than the actual surface form.

Why ASR works this way

This isn't a bug. Modern ASR models combine acoustic evidence with strong language expectations about what people usually say, and usually that means what native speakers say. This is why ASR models handle noise, accents, and ambiguity so well. For tool calling, conversational AI, or meeting transcription, recovering intent is the right job. Nobody needs the "ums" in their meeting notes.

The trouble is that this isn't the right target for language learners gets rewarded. Standard ASR metrics like Word Error Rate evaluate models on whether they produced the words they were supposed to recognize. Many evaluation datasets pair audio with transcripts that human annotators already cleaned up for subtitles or captions, which means that the prediction targets are already normalized. The model that scores best on those benchmarks may be precisely the model that's worst for teaching, because it's been trained, and graded, on its ability to make learner speech disappear.

This creates two failure modes for a learning product:

The learner loses the chance to improve. If the transcript quietly repairs the utterance, the learner walks away believing the sentence was fine. You can't give feedback like "you used filler words frequently" when the fillers were never recorded.
The system looks better than it is. Standard metrics reflect intent transcription instead of surface transcription: the system predicted the intended sentence while omitting what the learner actually said.

Two outputs, not one

A single kind of transcript can't do every job.

The learner-facing transcript should be readable. A clean version makes the experience feel approachable, and downstream LLMs handle polished text better anyway. Show that one in the UI.

But the system generating feedback should work from a faithful representation of the utterance, one that preserves hesitations, repetitions, partial words, self-repairs, omissions, and pronunciation errors. That's where teaching happens.

In a learning context, raw transcripts with disfluencies aren't noise. They're evidence. They tell us where fluency broke down, where the learner was uncertain, where pronunciation drifted, and where a feedback system should respond with correction or coaching.

The takeaway

Speech recognition keeps getting better at recovering intended text. For language learning, that isn't always the same as helping someone improve. A cleaner transcript isn't always a better one. If the model removes the "uh," repairs the grammar, or snaps a mispronounced word back to the target, it deletes the evidence the learner needed.

The best speech system for language learners isn't the one that sounds smartest. It's the one that's accurate enough for honest feedback.

‍

View all posts