AI · Stanley Jacob

Projects

Two-tier quality gate in CI

An eval harness for LLM-as-judge grading

When you build on top of a language model, you need to check whether its answers are any good, and there are far too many answers for a person to read them all. The usual fix is to have another model grade them, but that moves the problem: how do you know the grader is right? This project is my attempt to handle that carefully: I wrote a small benchmark of question-answer pairs, labeled each one by hand against a written rubric, and then measured how closely an automatic judge agrees with those human labels (Cohen's kappa, quadratic-weighted kappa, Spearman, MAE). A CI gate fails the build if the judge drifts away from human judgment, or if answer quality drops against a frozen baseline. It is a small study of 30 hand-labeled examples, but the calibration machinery is the part I cared about getting right.

LLM-as-judge Evaluation & calibration Python Claude API GitHub Actions pytest

Read the project write-up Source on GitHub ↗ Calibration report

Benchmarked against pyannote, demoed live

Speaker-attributed transcription, measured end to end

A transcript is only useful when it says who spoke, so this project joins Whisper, run through CTranslate2 for fast int8 inference, with a speaker-diarization pipeline assembled from open parts and unit-tested at every join, scored by WER and diarization error rate written from scratch against hand-computed values. Calibrating its two constants on held-out data drops DER from 5.64% to 0.38% on the clean benchmark, where it beats the pretrained pyannote reference twice over; on real AMI meetings the ranking flips, and the write-up explains exactly why both results are true. There is a demo page with the transcript synced to the audio and ground truth overlaid, and a serving layer that streams speaker-attributed words from a browser tab about five seconds behind real time.

Whisper CTranslate2 Speaker diarization ECAPA-TDNN Python GitHub Actions

Read the project write-up Source on GitHub ↗

Fine-tuned, served, and RL-trained on 4x H100

A full fine-tune of an 8B Rust coder, then RL with the compiler

Rather than reaching for LoRA, this project runs a full BF16 fine-tune of Qwen3-8B for idiomatic Rust on four NVLinked H100s, on a corpus of 14,080 pairs harvested from crates.io through a license gate stricter than SPDX semantics, deduped, and blended 1:2 with anchor data against forgetting. The gates were chosen so the run could fail, and one fired: the domain metrics jumped (edit similarity 0.26 to 0.57, exact match 0 to 23%) and MBPP held, but HumanEval dropped 7.3 points, a real forgetting cost an unmeasured fine-tune would have shipped. The model then quantizes to FP8, serves through vLLM, and answers a smoke prompt that this repo's own cargo judge verifies. A GRPO reinforcement pass with the compiler as the reward improves held-out pass rate by 2.5 points and compile rate by 4.4 on 158 unseen problems.

Full fine-tune FSDP2 Qwen3-8B Axolotl FP8 vLLM

Read the project write-up Source on GitHub ↗

Visual defect inspection

Flags defective parts from product images, turning a manual visual QA step into an automated check. A fine-tuned vision transformer labels each image pass or defect with a confidence score.

ViT PyTorch Transfer learning

Model: vit-base-patch16-224 ↗

Document Q&A with RAG

Answers natural-language questions over a private document set and returns grounded answers with citations to the source passages, instead of hallucinating.

Mistral-7B RAG FAISS

Model: Mistral-7B-Instruct ↗

Text-to-image with LoRA

Generates on-brand images from text prompts and reproduces a custom subject or style after LoRA fine-tuning, producing synthetic training data on demand.

Stable Diffusion XL Diffusers LoRA

Model: SDXL base ↗

Semantic search & recommendations

Surfaces the most relevant items for a query by meaning rather than keywords, and recommends related content from the same embedding space.

Embeddings sentence-transformers FAISS

Model: all-mpnet-base-v2 ↗

Inference optimization benchmark

Lowers model serving latency and raises throughput by converting to ONNX and TensorRT and serving on Triton, with a before-and-after report of p50/p99 latency and throughput.

TensorRT ONNX Runtime Triton

Model: bert-base-uncased ↗

Machine learning

Decision-focused machine learning on tabular data, where the constraints are proof, fairness, and governance rather than scale.

Audited against planted bias

A credit risk model whose fairness audit can be proven to work

Replacing a bank's scoring model is a proof problem: the challenger must beat the incumbent, survive a fair-lending audit, and explain every decline. This project builds that whole arc on an 800,000-application simulated portfolio where protected attributes are causally inert and bias is injected deliberately, so the fairness suite is tested against ground truth: it has to catch the planted redlining proxy, and it does, rejecting the challenger with the best Gini. The mitigated champion keeps a +0.102 Gini lift over the incumbent scorecard, worth an estimated $13.8M a year in avoided default losses at a constant approval rate, declined applicants get SHAP reason codes that sum to their actual score margin, and the governance set (model card, SR 11-7 style validation, monitoring plan) ships with it. A twelve-model zoo on the approved features maps the rest of the trade space: deep models tie the boosted trees, a glass-box EBM recovers 97% of the lift, and regulator-friendly monotone constraints price out at 0.053 Gini.

LightGBM FT-Transformer SHAP Fair lending FastAPI Python

Read the project write-up Source on GitHub ↗

Projects

An eval harness for LLM-as-judge grading

Speaker-attributed transcription, measured end to end

A full fine-tune of an 8B Rust coder, then RL with the compiler

Visual defect inspection

Document Q&A with RAG

Text-to-image with LoRA

Semantic search & recommendations

Inference optimization benchmark

Machine learning

A credit risk model whose fairness audit can be proven to work

Notes