Nearly everything here is built on the transformer, the architecture that now powers language, vision, audio, and speech models alike. I studied these models in school just as they were emerging, and what holds my attention now is less the architecture itself than what you can build on it: applications that retrieve, remember, and measure their own output rather than models in isolation. The projects cover the areas I keep coming back to: speech and natural language, retrieval-augmented generation, computer vision, and making models cheaper to adapt and serve with techniques like LoRA, along with notes on what I learn as I go.
When you build on top of a language model, you need to check whether its answers are any good, and there are far too many answers for a person to read them all. The usual fix is to have another model grade them, but that moves the problem: how do you know the grader is right? This project is my attempt to handle that carefully: I wrote a small benchmark of question-answer pairs, labeled each one by hand against a written rubric, and then measured how closely an automatic judge agrees with those human labels (Cohen's kappa, quadratic-weighted kappa, Spearman, MAE). A CI gate fails the build if the judge drifts away from human judgment, or if answer quality drops against a frozen baseline. It is a small study of 30 hand-labeled examples, but the calibration machinery is the part I cared about getting right.
A transcript is only useful when it says who spoke, so this project joins Whisper, run through CTranslate2 for fast int8 inference, with a speaker-diarization pipeline assembled from open parts and unit-tested at every join, scored by WER and diarization error rate written from scratch against hand-computed values. Calibrating its two constants on held-out data drops DER from 5.64% to 0.38% on the clean benchmark, where it beats the pretrained pyannote reference twice over; on real AMI meetings the ranking flips, and the write-up explains exactly why both results are true. There is a demo page with the transcript synced to the audio and ground truth overlaid, and a serving layer that streams speaker-attributed words from a browser tab about five seconds behind real time.
Rather than reaching for LoRA, this project runs a full BF16 fine-tune of Qwen3-8B for idiomatic Rust on four NVLinked H100s, on a corpus of 14,080 pairs harvested from crates.io through a license gate stricter than SPDX semantics, deduped, and blended 1:2 with anchor data against forgetting. The gates were chosen so the run could fail, and one fired: the domain metrics jumped (edit similarity 0.26 to 0.57, exact match 0 to 23%) and MBPP held, but HumanEval dropped 7.3 points, a real forgetting cost an unmeasured fine-tune would have shipped. The model then quantizes to FP8, serves through vLLM, and answers a smoke prompt that this repo's own cargo judge verifies. A GRPO reinforcement pass with the compiler as the reward improves held-out pass rate by 2.5 points and compile rate by 4.4 on 158 unseen problems.
Flags defective parts from product images, turning a manual visual QA step into an automated check. A fine-tuned vision transformer labels each image pass or defect with a confidence score.
Answers natural-language questions over a private document set and returns grounded answers with citations to the source passages, instead of hallucinating.
Generates on-brand images from text prompts and reproduces a custom subject or style after LoRA fine-tuning, producing synthetic training data on demand.
Surfaces the most relevant items for a query by meaning rather than keywords, and recommends related content from the same embedding space.
Lowers model serving latency and raises throughput by converting to ONNX and TensorRT and serving on Triton, with a before-and-after report of p50/p99 latency and throughput.
Decision-focused machine learning on tabular data, where the constraints are proof, fairness, and governance rather than scale.
Replacing a bank's scoring model is a proof problem: the challenger must beat the incumbent, survive a fair-lending audit, and explain every decline. This project builds that whole arc on an 800,000-application simulated portfolio where protected attributes are causally inert and bias is injected deliberately, so the fairness suite is tested against ground truth: it has to catch the planted redlining proxy, and it does, rejecting the challenger with the best Gini. The mitigated champion keeps a +0.102 Gini lift over the incumbent scorecard, worth an estimated $13.8M a year in avoided default losses at a constant approval rate, declined applicants get SHAP reason codes that sum to their actual score margin, and the governance set (model card, SR 11-7 style validation, monitoring plan) ships with it. A twelve-model zoo on the approved features maps the rest of the trade space: deep models tie the boosted trees, a glass-box EBM recovers 97% of the lift, and regulator-friendly monotone constraints price out at 0.053 Gini.
Write-ups on models, papers, and coursework I'm working through.