Speech recognition: Whisper and CTC

Automatic speech recognition turns a waveform into text. The central difficulty is alignment: you have thousands of audio frames and a short transcript, but no label saying which frames produced which characters. Three families solve this, CTC, attention encoder-decoders, and transducers, and the modern large models combine large-scale (often self-supervised or weakly-supervised) pretraining with these decoders. This is the backbone of my speech transcription and diarization project.

The pipeline is: turn audio into features, model the sequence, and decode to text. Each stage has a why worth knowing, and the choice of alignment mechanism is what most distinguishes the approaches.

From waveform to features

Raw audio is sampled tens of thousands of times per second, which is far too fine and too high-variance to model directly. It is first converted to a log-mel spectrogram: short overlapping windows are Fourier-transformed, the energies are pooled into mel bands spaced to match human pitch perception, and a log compresses the range. The why is twofold: this throws away phase detail that does not affect the words, and it yields a compact time-by-frequency image that the rest of the system can treat much like vision input.

CTC: alignment by marginalizing

Connectionist temporal classification (Graves et al. 2006) handles missing alignments by adding a blank symbol and summing the probability over every frame-level alignment that collapses to the target text,

$$p(y \mid x) = \sum_{a \,\in\, \mathcal{B}^{-1}(y)} \prod_{t=1}^{T} p(a_t \mid x),$$

where $\mathcal{B}$ removes repeats and blanks. That exponential sum is computed efficiently with a dynamic-programming forward algorithm. The crucial assumption is that frames are conditionally independent given the audio, which is what makes CTC fast and naturally streaming, but also why it struggles with context-dependent spellings unless paired with an external language model. Decoding can be greedy (argmax per frame, then collapse) or beam search over hypotheses.

Attention and transducers

An attention encoder-decoder (listen, attend, spell; Chan et al. 2016) instead generates text autoregressively, attending over the encoded audio, which drops CTC's independence assumption and models output dependencies, at the cost of being harder to stream. The RNN transducer (Graves 2012) adds a prediction network to CTC so outputs are conditioned on previous outputs while staying streamable, which is why RNN-T is the workhorse of on-device, real-time ASR.

Self-supervised pretraining

Labeled speech is scarce, but raw audio is abundant, so the field learned to pretrain on unlabeled audio. wav2vec 2.0 (Baevski et al. 2020) masks spans of latent audio features and learns by a contrastive task to pick the true quantized latent from distractors; HuBERT (Hsu et al. 2021) instead predicts cluster labels of masked frames, a BERT-style masked-prediction objective. The payoff is large: a model pretrained this way reaches strong word-error rates after fine-tuning on a tiny fraction of the labeled data otherwise required.

Conformer: local plus global

Speech has both local structure (formants, phones) and long-range structure (prosody, context). The Conformer (Gulati et al. 2020) interleaves convolution, which captures local patterns efficiently, with self-attention, which captures global dependencies, and became the standard acoustic encoder because it gets the best of both.

Whisper: weak supervision at scale

Whisper (Radford et al. 2022) takes a different bet: a plain transformer encoder-decoder trained on a very large amount of weakly-supervised multilingual audio scraped with its transcripts. The why is robustness, since breadth of data beats architectural cleverness for generalization: Whisper handles noise, accents, and many languages out of the box, and produces punctuation, casing, and even translation for free, without per-dataset fine-tuning. The trade-off is that its autoregressive decoder is heavier than a CTC head.

Making it fast for deployment

For real use the model has to be cheap. I run Whisper through faster-whisper (CTranslate2), which quantizes weights and fuses kernels for a large speedup, with voice-activity detection to skip silence and word-level timestamps for alignment. Distillation goes further: Distil-Whisper (Gandhi et al. 2023) trains a smaller student on the large model's pseudo-labels, reaching several times the speed within about one percent word-error rate, which is what makes near real-time transcription on a single GPU practical.

Step by step: CTC greedy decode

Take the most likely symbol at each frame (argmax).
Collapse runs of the same symbol into one.
Remove the blank symbol.
What remains is the decoded sequence.
Beam search keeps several hypotheses and can fold in a language model.

import numpy as np

def ctc_greedy_decode(logits, blank=0):
    """logits: (T, vocab). Greedy best-path decode."""
    best = logits.argmax(axis=-1)        # most likely symbol per frame
    out, prev = [], None
    for s in best:
        if s != prev and s != blank:     # collapse repeats, then drop blanks
            out.append(int(s))
        prev = s
    return out

Complexity (time and space)

CTC greedy decoding is $O(T \cdot V)$ for $T$ frames and vocabulary $V$, dominated by the per-frame argmax, with $O(1)$ extra state. Training the CTC loss is $O(T \cdot |y|)$ per utterance via the forward-backward dynamic program. An attention or transducer decoder costs more because it is autoregressive over output tokens, which is the price of dropping CTC's independence assumption.

Worked example

With a blank class at index 0, two confident frames for symbol 1 collapse to a single 1, a blank frame is dropped, and a final frame for symbol 2 is appended:

import numpy as np
logits = np.array([[0, 9, 0],   # argmax 1
                   [0, 9, 0],   # argmax 1 (repeat, collapsed)
                   [9, 0, 0],   # argmax 0 = blank (dropped)
                   [0, 0, 9]])  # argmax 2
print(ctc_greedy_decode(logits))   # [1, 2]

Follow-up questions

Why does CTC need a blank symbol? Blanks let the model emit "no output here" and separate repeated characters, so many frame alignments can collapse to the same text and be summed over.
What assumption makes CTC fast but limited? Conditional independence of frames given the audio, which enables the efficient forward algorithm and streaming but weakens modeling of output dependencies.
Why convert to a log-mel spectrogram first? It discards phase that does not affect the words and matches human pitch perception, giving a compact, robust representation.
Why pretrain on unlabeled audio? Labels are scarce; masked or contrastive pretraining (wav2vec 2.0, HuBERT) learns representations that fine-tune with a tiny fraction of the labeled data.
Why did Whisper favor data scale over architecture? Broad weakly-supervised data buys robustness to noise, accents, and languages, which transfers better than a cleverer model on narrow data.
CTC vs transducer for streaming? Both stream, but the transducer conditions on previous outputs via a prediction network, modeling dependencies CTC cannot.

References

Graves et al., Connectionist Temporal Classification (2006).
Graves, Sequence Transduction with Recurrent Neural Networks (RNN-T, 2012).
Chan et al., Listen, Attend and Spell (2016).
Baevski et al., wav2vec 2.0 (2020).
Gulati et al., Conformer (2020).
Hsu et al., HuBERT (2021).
Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper, 2022).
Gandhi et al., Distil-Whisper (2023).