People record meetings, interviews, lectures, and support calls, and then the recordings mostly sit unused, because an hour of audio is opaque, and even a plain transcript of it cannot answer "who agreed to do what" or "what did the customer object to". The purpose of this project is to turn a raw audio file into the form in which a conversation becomes useful: a transcript that says who said what and when, searchable and quotable, built entirely from open components and cheap enough to run on hardware I already have.
A second purpose shaped the design more than the first: to measure the system well enough that the numbers can carry real decisions. The benchmark on this page answers the questions a team shipping this would actually face. Which Whisper size should be served, where the answer turned out not to be the biggest one. Whether int8 quantization is safe, where the answer is mostly, with one failure mode worth knowing about before deploying it. Where the diarization errors come from, where the answer was speaker counting and boundary trimming, both of which calibration later recovered on the clean benchmark. And whether a pipeline built from parts holds up against the pretrained pyannote reference, where the answer reverses between clean benchmarks and real meetings, and the reversal turned out to be the most useful finding on this page. Whisper, run through CTranslate2, produces the words; a diarization pipeline assembled from open parts decides the speakers; a small, fully tested alignment joins the two; and the scoring metrics are implemented from scratch and checked against examples worked out by hand, which step 4 explains concretely. The code is at github.com/gradientsj/speech-diarization-lab.
The pipeline at a glance
Two paths leave the audio file and meet at the join: the ASR path produces words with timestamps, the diarization path produces speaker turns, and the alignment assigns each word to the turn it overlaps most.
Step 1: transcription, and why CTranslate2
The words come from Whisper, OpenAI's open speech recognition model. It is an encoder-decoder transformer: the audio is first converted into a spectrogram, which is a picture of how much energy each frequency carries at each moment in time, the encoder reads that picture, and the decoder writes out text tokens one at a time, the same way a translation model writes a sentence.
The reference implementation runs in PyTorch, and for serving that is the slow way. faster-whisper runs the same trained weights on CTranslate2, an inference engine written in C++ specifically for transformer decoding. Its speedups are unglamorous and they add up: operations that PyTorch dispatches one by one are fused into single kernels, weights are laid out in memory for sequential access, and the beam search that picks the best transcription shares computation across its candidates. The option that matters most here is int8 quantization: the weights are stored as 8-bit integers instead of 16-bit floating point numbers, which halves the memory the processor has to stream and lets matrix multiplies run in cheap integer arithmetic. Quantization rounds every weight to its nearest representable value, and rounding can cost accuracy, so whether int8 is safe is not something to assume; it is one of the things the benchmark below measures, and the answer turned out to have an interesting exception.
Speed is reported as a real-time factor, RTF, which is processing time divided by audio duration: an RTF of 0.1 means one minute of audio transcribes in six seconds. Whisper is also asked for timestamps per word rather than per sentence, because the speaker assignment in step 3 happens word by word.
Step 2: diarization built from parts
Diarization answers "who spoke when" without knowing who anyone is. The system has to discover how many voices a recording contains and which spans of time belong to which voice, with no enrollment and no names. Most products call a pretrained end-to-end pipeline for this and treat it as a black box; this one is assembled from four visible stages instead, because the joins between stages are where such systems quietly fail, and a join you can see is a join you can test.
First, find the speech. Silero VAD is a small open voice-activity-detection network. It slides along the waveform in chunks a few tens of milliseconds long and outputs, for each chunk, the probability that someone is speaking in it. Keeping the chunks above a threshold and joining adjacent ones produces speech regions: spans like 0.4 s to 7.2 s with talking inside and silence or noise outside. Everything downstream looks only inside these regions, because a voice fingerprint computed over silence describes the room, not a person.
Second, turn the speech into voice fingerprints. A speaker embedding model maps a short clip of speech to a list of 192 numbers. ECAPA-TDNN is such a model, trained on a large corpus of labeled speakers for the task of speaker verification, deciding whether two clips come from the same person. That training pushes clips of the same voice toward nearby vectors and clips of different voices far apart, so the distance between two fingerprints measures how unlikely they are to be the same speaker. The speech regions are cut into windows 1.5 seconds long, one starting every 0.75 seconds, and each window gets its fingerprint. Windows are short so that each one mostly contains a single voice, and they never cross a silence boundary, which is where the speaker is most likely to have changed.
Third, group the fingerprints. At this point there are a few dozen vectors and no labels, which is a clustering problem. Agglomerative clustering starts with every window as its own cluster and repeatedly merges the two closest clusters, stopping when the closest remaining pair is further apart than a threshold. Closeness here is cosine distance, the angle between two vectors with their lengths ignored, and the distance between two clusters is the average over all pairs of their members, called average linkage, which keeps one stray window from bridging two otherwise clean groups. However many clusters remain when the merging stops is the number of speakers the system believes it heard. That stopping threshold is the one real hyperparameter in the pipeline: too loose and two people merge into one speaker, too strict and one person splits into two. The results below show that this is exactly where the remaining error lives.
Fourth, draw the turns. Consecutive windows that landed in the same cluster merge into one span of speech per speaker. Where two overlapping windows disagree about the speaker, the boundary goes at the midpoint of the overlap, since neither window has better evidence inside the contested region.
The pretrained pyannote 3.1 pipeline is wired in as a second backend behind the same interface, so the from-parts pipeline gets judged against a strong reference under identical metrics rather than only against itself.
Step 3: joining words to speakers
The transcriber and the diarizer run independently and disagree at the edges, so the rules of the join decide the quality of the final transcript. The main rule is maximum overlap, and it is easiest to see with numbers. Suppose the diarizer decided speaker A holds 12.0 to 15.4 seconds and speaker B holds 15.4 to 19.0, and Whisper timed the word "really" at 15.3 to 15.6 seconds. The word overlaps A's turn by 0.1 seconds and B's by 0.2, so it goes to B. Exact ties go to the earlier turn, so running the pipeline twice gives the same transcript. A word that touches no turn at all, which happens when the diarizer's speech detection trimmed more audio than Whisper did, is given to the nearest turn only if the gap is under a second; past that it stays unattributed, because in a transcript a visible gap is more useful than a confident guess. Finally, consecutive words by the same speaker group into display segments, breaking whenever the speaker changes or a pause runs longer than a second, which is what keeps subtitles readable.
Step 4: scoring you can audit
Two numbers decide whether any of the above actually works, so both are implemented in plain Python in the repository rather than imported from a library, and "checked against examples worked out by hand" means exactly that: each test case was computed on paper first, and the code is required to reproduce it. WER, word error rate, lines the system's transcript up against the reference and counts the cheapest set of edits that turns one into the other. If the reference is "the cat sat" and the system wrote "the cat sat down", that is one inserted word against three reference words, a WER of 1/3. The test suite holds dozens of such cases, substitutions, deletions, insertions, and combinations, each with its precomputed answer.
DER, diarization error rate, scores who spoke when as time rather than words, and it splits the error three ways: miss is reference speech the system attributed to nobody, false alarm is claimed speech where nobody was talking, and confusion is speech handed to the wrong person. Two subtleties make it easy to get wrong. The system's labels are arbitrary, its SPEAKER_00 might be the reference's bob, so before scoring, the labels are paired up by whichever assignment explains the most time. And a quarter-second collar around every true boundary is excluded from scoring, because even human annotators cannot place a turn boundary more precisely than that. The hand-worked test cases include the awkward situations: two people talking at once, every span given to the wrong speaker, boundaries jittered just inside the collar. Where a metric is undefined it returns NaN, and NaN is treated as failure, never as a pass, because "this could not be measured" must never read as a perfect score.
The benchmark is built rather than downloaded: single-speaker LibriSpeech utterances are interleaved into 12 seeded conversations of two to four speakers, 10.9 minutes of audio in total, so the turn boundaries and transcripts are exact by construction and anyone can rebuild the identical benchmark with no credentials. The trade-off is stated plainly: there is no overlapped speech, no channel mismatch, and the acoustics are read speech, so these scores are an upper bound on conversational performance, useful for comparing systems under identical conditions rather than for quoting as real-world accuracy.
Results
All numbers are pooled over the 12 mixtures. GPU rows ran on an
NVIDIA A10, CPU rows on my desktop, and the per-mixture data
lives in the repo's reports/ directory.
| model | compute | hardware | WER | RTF |
|---|---|---|---|---|
| tiny | float16 | A10 | 5.50% | 0.015 |
| base | float16 | A10 | 3.87% | 0.016 |
| small | float16 | A10 | 2.84% | 0.022 |
| large-v3 | float16 | A10 | 4.23% | 0.087 |
| large-v3 | int8_float16 | A10 | 4.59% | 0.098 |
| large-v3 | int8 | A10 | 3.87% | 0.083 |
| tiny | int8 | CPU | 5.68% | 0.117 |
| base | int8 | CPU | 3.57% | 0.185 |
| small | int8 | CPU | 6.17% | 0.559 |
Two findings in that table were worth the whole exercise. The first is that small beats large-v3 here, 2.84% against 4.23% at float16, at a quarter of the runtime cost; on clean read speech the largest model buys nothing, which is why serving decisions should come from a measured table rather than from model reputation. The second is hiding in the last row: int8 quantization is accuracy-neutral at every other size, but small at int8 on CPU matches float16 on 11 of the 12 mixtures and then silently drops whole utterances on the twelfth, deleting 59 of its 157 reference words and inflating that mixture's WER to 40.1% against 3.8% at float16. The deletions come in consecutive runs, which points at the quantized model misjudging stretches of real speech as silence and skipping past them, rather than mistranscribing them. A pooled average would have read as "int8 is a bit worse"; the per-mixture rows are what turned it into "int8 can silently skip speech", which is the kind of tail failure a serving system needs to know exists.
| diarizer | constants | DER | miss | false alarm | confusion |
|---|---|---|---|---|---|
| built from parts | calibrated (0.75 / 0.25) | 0.38% | 0.32% | 0.06% | 0.00% |
| built from parts | original defaults (0.60 / 0.05) | 5.64% | 4.36% | 0.00% | 1.28% |
| built from parts | oracle speaker count | 4.36% | 4.36% | 0.00% | 0.00% |
| pyannote 3.1 (reference) | as shipped | 9.80% | 7.57% | 0.00% | 2.24% |
Three things in that table carried the project forward. First, the diarization number is identical in every ASR configuration and identical between the Windows CPU and Linux A10 runs per mixture to three decimals: the pipeline is decoupled from the ASR and deterministic across platforms. Second, the oracle row was the diagnosis that drove the calibration: told the true number of speakers, the clusterer attributes nothing to the wrong person, so the original 1.28% confusion was all speaker counting and the 4.36% floor was missed speech at the VAD boundaries, and the next section shows both components being recovered without giving the system the answer. Third, the from-parts pipeline beats the pretrained reference here, and the caveat matters as much as the number: pyannote's extra miss (7.57% against 4.36%) comes from trimming utterance boundaries more aggressively on exactly the kind of clean, hard-cut read speech this benchmark constructs. The claim has to stay narrow, on this regime the simple pipeline is sufficient, and the real test of the reference comes further down.
Calibrating the two constants, with held-out discipline
The pipeline has two tunable numbers: the clustering distance threshold and the VAD edge padding. Each was swept on six mixtures and scored on the six it never saw, so the chosen value has to generalize rather than memorize the calibration set. The threshold sweep lands on 0.75, where the speaker count is right on all 12 mixtures and confusion is exactly zero, and it exposes an asymmetric failure mode worth knowing: below the optimum the DER curve degrades gently, but above 0.80 it collapses (8.5% at 0.85, 31.7% at 0.90) because cutting too high merges two real speakers into one cluster, which no amount of later mapping can forgive. When in doubt, err low.
The padding sweep then attacked the miss floor and found it was VAD edge trimming all along: padding speech regions by 250 ms takes pooled DER from 4.36% to 0.38% with the speaker count still right everywhere. One caveat belongs next to that number: the calibrated pad equals the scoring collar, the 0.25 s around each true boundary that DER deliberately does not score, so part of the gain is recovering speech in a region the metric forgives by design. That is why the collar is stated with every number rather than buried in a config file.
Overlapped speech
A second benchmark set makes consecutive speakers overlap with probability 0.5, summing the waveforms, so 6.5% of all speech time has two people talking at once. With no recalibration, the from-parts pipeline scores 4.50% DER and the structure of the error is exactly what the architecture predicts: the pipeline emits one speaker at a time by construction, so almost the whole 4.13% miss component is the overlapped time it cannot represent. The expected upset did not happen, though: pyannote 3.1 can emit overlapping turns and this set was its chance, but it scored 11.73%, with more miss than the simple pipeline and confusion growing to 4.10%. Transcription also pays a structural tax, with WER rising from 2.84% to 3.75%, since Whisper decodes one stream and words spoken over other words are unrecoverable.
Real meetings, where the ranking flips
The last set drops the synthetic crutch entirely: three four-speaker meetings from the AMI corpus test partition, 77 minutes of real conversation with genuinely overlapping reference turns, scored by the same DER implementation with nothing recalibrated. Here the reference finally wins, 13.25% against 18.71% for the from-parts pipeline on its read-speech constants, and that flip is the point of the whole exercise: the simple pipeline is sufficient exactly up to the regime it was calibrated for, and a benchmark on which the pretrained model could never win was measuring the wrong thing.
The failure is specific and instructive. The 0.75 threshold, calibrated on read speech, estimates 24 to 50 speakers for four-person meetings, because spontaneous speech scatters embeddings far wider than read speech. The calibration procedure transfers even though the constant does not: rerunning the sweep on three AMI development meetings (never the test meetings) chose 0.80 and closed 40% of the gap to the reference (18.71% to 16.55%), but no fixed-distance cut will ever count speakers correctly on this audio, which makes a proper stopping rule the top roadmap item. Long real audio also surfaced a second ASR tail failure: on one meeting, small at float16 fell into a repetition loop (92.5% WER on that file) while the same model at int8 scored 23.6% on the same audio, reproduced across two runs, and once again invisible in pooled averages until the per-file rows were read.
Hearing it, and serving it
Scores do not convey what output feels like, so there is a demo page with three benchmark mixtures and the speaker-attributed transcript synced to the audio: a per-speaker timeline, words that highlight as they are spoken, click-any-word-to-seek, and a toggle that draws the ground-truth turns under the predictions so the errors are visible rather than just scored. The transcripts are raw pipeline output, hallucinations included; one mixture ends with Whisper inventing "Thank you for watching." over silence, plainly audible in one listen and invisible in a pooled WER. The repo also grew a serving layer: a FastAPI job API that accepts an upload and streams speaker-attributed segments over a WebSocket as the decode produces them, and a live mode that captures a browser tab or microphone and returns attributed words about five seconds behind real time, with speaker identities held stable across the session by an online centroid tracker whose matching threshold was measured on the benchmark (same-speaker centroids sit at 0.20 to 0.42 cosine distance, different speakers at 0.58 and up, so it cuts at 0.50).
Limitations
The clean and overlapped sets are synthetic, and the real-data check is three AMI meetings, enough to flip the ranking but not to characterize it. Speaker counting on spontaneous speech is broken in a way no constant fixes, since fixed-distance dendrogram cutting estimates 13 to 32 speakers on four-person meetings even after domain calibration, and a real stopping rule is the first roadmap item. The overlap miss is structural until the pipeline can assign a window to two speakers at once. The ASR repetition loop on long audio wants a decode guard and a regression test. And everything is one run per configuration; the numbers come with their configs in the repo so anyone can shake them.
Links
- The demo: listen to the output with the transcript synced and ground truth overlaid
- Source on GitHub
- The results section of the README (full tables, sweeps, and findings)
- Per-mixture reports and sweep grids (every number above, with its configuration)