General code models spread their capacity across dozens of languages, and Rust, with its borrow checker, lifetimes, and distinctive error-handling idioms, is where that dilution shows. The purpose of this project is a model a Rust-heavy team could actually serve themselves: an 8B-parameter coder specialized on permissively licensed Rust through a full fine-tune, on rented hardware, with the data pipeline, the legal hygiene, and the evaluation gates all built rather than assumed. The full path has now run end to end: the corpus harvest, the fine-tune, the evaluation gates, a quantized serving deployment verified by the compiler, and a reinforcement learning pass that uses the compiler as its reward. The forgetting guardrail caught a real regression along the way, which is precisely the job it was built for. This page walks through the decisions first and the measured results after, because the decisions are what make the numbers mean something. The code is at github.com/gradientsj/rust-coder-ft.
The pipeline at a glance
Step 1: full fine-tune, not LoRA, and the arithmetic that allows it
LoRA, the usual budget option, freezes the model and trains small low-rank adapter matrices alongside it, cheap but capacity-limited. A full fine-tune updates every weight, which is the stronger tool for genuinely shifting a model's distribution toward a domain, and the reason it is rarely done at home is memory. The arithmetic for an 8B model: weights in bf16 take 16 GB, their gradients another 16 GB, and the optimizer's fp32 master copy plus Adam's two moment tensors another 96 GB, 128 GB in total before a single activation. No consumer GPU holds that, but FSDP (fully sharded data parallel) splits every one of those tensors across GPUs, so on four H100s the resident cost is about 32 GB per GPU, leaving roughly 45 GB for activations at a 4096-token sequence length with activation checkpointing. Sharding means the GPUs constantly exchange weight shards, which is why the hardware verification mattered: all four GPUs are fully NVLinked at roughly 900 GB/s bidirectional, making full resharding after every forward pass cheap enough that no hybrid scheme is needed.
The effective batch is 2 sequences per GPU times 4 GPUs times 8 gradient-accumulation steps, 64 sequences or roughly 256k tokens per optimizer step, at a learning rate of 1.5e-5 on a cosine schedule. Full fine-tunes can diverge early, so checkpoints are written every 25 steps with a rolling window of eight kept, cheap rollback insurance against a torched run. One detail from the base-model selection worth stating plainly: there is no dense 7-8B model in the Qwen3-Coder family, which ships only mixture-of-experts sizes, so the base is the dense Qwen3-8B and the code specialization is exactly what this project adds. Training data is formatted with the tokenizer's own chat template, including the empty think tags Qwen3 emits in non-thinking mode, so the format the model is trained on is byte-identical to the format it will be served with; a mismatch there is an easy way to lose quality and a hard one to debug.
Step 2: a corpus that survives a license review
The training data is harvested from crates.io, Rust's package registry, as published source tarballs rather than git checkouts, because the registry reports a machine-readable SPDX license expression per version while a repository's head is whatever its README claims. The license gate is deliberately stricter than the law requires: every token of the SPDX expression must be on the allowlist (MIT, Apache-2.0, BSD), so "MIT OR GPL-3.0", which a lawyer would accept by electing MIT, is rejected anyway, meaning no reviewer of this corpus ever has to reason about license election. Files carrying their own contradicting license header are dropped individually, and every surviving file logs its crate, version, and license, so any training example can be traced to its legal basis.
Quality filtering runs cheapest-first with every rejection counted: size bounds against stubs and vendored blobs, line-shape bounds against minified output, an alphanumeric-fraction check against binary-ish content, and marker detection against generated code (bindgen, protobuf, "do not edit"). Then deduplication, which matters twice over: duplicated training text wastes steps and encourages memorization, and any overlap between training and held-out data quietly inflates every evaluation number. Exact duplicates are caught by hashing whitespace-normalized text; near-duplicates, the renamed forks and vendored copies, are caught with MinHash, a sketching technique that estimates how much two documents' 7-token shingles overlap without comparing them pairwise, dropping anything above 85% estimated similarity.
Step 3: training pairs from documented functions
The supervised pairs come from documented public functions in the surviving files, in two shapes: the function's doc comment plus its signature with a todo!() stub, completing to the real implementation, and a natural-language instruction derived from the doc, answering with the implementation. Extracting a function body from Rust source sounds trivial and is not: a brace inside a string literal, a nested block comment (legal in Rust), or a lifetime tick that looks like a character literal will all derail naive brace matching, so the extractor first builds a character-level mask of what is real code versus comment, string, or char literal, and only then matches braces. Functions with fewer than 30 characters of documentation or three lines of body are skipped; what remains is the kind of pair where the prompt genuinely specifies the code.
Step 4: the forgetting guardrail
The known failure mode of domain fine-tuning is catastrophic forgetting: the model gets better at Rust and quietly worse at everything else, including general coding ability. Two defenses are built in. First, the training mix blends one domain pair with two anchor records drawn from open instruction datasets, split 40% code instructions, 40% general instructions, and 20% raw Rust text, so the fine-tune never sees a pure-domain distribution. Second, and more importantly, the harness runs HumanEval and MBPP, the standard general coding benchmarks, before and after training, and the delta is an explicit gate: a Rust gain that costs general coding is a failed run, not a shipped model.
Step 5: measuring with the compiler, not perplexity
Perplexity, the default language-model metric, rewards reproducing text, and code that resembles the reference can still fail to build. Rust offers something better for free: the compiler is the ground truth. The primary domain metric is therefore compile-pass rate on held-out Rust tasks, with cargo build, cargo test, and clippy (Rust's linter, as a proxy for idiomatic style) as graduated bars. After training, the FSDP shards consolidate back to a standard checkpoint, an FP8-quantized export follows for cheap serving (FP8 is an 8-bit floating-point format H100s execute natively), and a vLLM endpoint serves the result. Using FP8 during training itself was measured rather than assumed: on a one-batch overfit harness the FP8 loss curve matched BF16 step for step, but throughput dropped 37% because the torchao float8 path needs torch.compile to pay off, so the real run trained in BF16 and FP8 stayed a serving-side story.
The fine-tune, measured
The full-scale harvest attempted 201 crates and kept 180, with 21 rejected by the strict license gate doing exactly its job (rustls, for example, carries "Apache-2.0 OR ISC OR MIT" and is refused despite being electable). 4,916 files became 4,248 after quality filtering and dedup, yielding 14,080 domain pairs blended with 28,160 anchor records at exactly the 1:2 target, with a 668-pair domain holdout split at the file level so near duplicates of training files cannot leak into evaluation. Training ran 3 epochs, 225 optimizer steps, 54 minutes on the four H100s, with evaluation loss falling from 1.22 to 0.625 and essentially flat after the first epoch.
| held-out domain (100 pairs) | base Qwen3-8B | fine-tuned | delta |
|---|---|---|---|
| mean edit similarity | 0.260 | 0.566 | +0.306 |
| exact-match rate | 0.0% | 23% | +23 pts |
One measurement plan had to bend to reality: compile-pass rate on this holdout is void, because only 1 of the 100 held-out references compiles standalone, the rest referencing private crate internals. The harness was designed for this (reference-compile filtering admits a pair to the compile metric only when its reference solution itself compiles), and the compile signal instead comes from the served model's output passing build, clippy, and tests downstream.
| retention | before | after | delta |
|---|---|---|---|
| HumanEval pass@1 | 0.628 | 0.555 | -7.3 pts |
| MBPP pass@1 | 0.658 | 0.656 | -0.2 pts |
This is the guardrail earning its place. The domain gain is large and MBPP is fully retained, but HumanEval regressed 7.3 points, a real forgetting cost that an unmeasured fine-tune would have shipped silently. The diagnosis is plausible and testable: HumanEval is raw-completion Python, the protocol most distant from chat-format Rust training, and the queued levers are two epochs instead of three (the loss was flat after one), a larger general-instruct anchor share, and a raw-Python anchor pool alongside the raw-Rust one.
Serving the result
The trained checkpoint consolidates to a standard 8.19B-parameter model, quantizes data-free to FP8 (16.4 GB down to 9.44 GB, with the output head kept in bf16), and serves through vLLM. The smoke test closes the loop in the most satisfying way available: the served model is asked for a Rust function over the OpenAI-compatible endpoint, and the response is piped through this repo's own cargo judge, where it compiles, passes clippy clean, and passes its tests in the sandbox. The serving stack and the evaluation stack compose, which is what makes the next section possible.
Teaching with the compiler: a GRPO pass
Supervised fine-tuning imitates reference code; it cannot reward an answer for being correct, only for being similar. GRPO (group relative policy optimization) closes that gap with reinforcement learning: for each problem the model samples a group of candidate solutions, each is scored by a reward, and the policy shifts toward candidates that scored above their group's average, no learned reward model required when the reward is verifiable. Here the reward is the cargo judge with graded tiers (penalty for malformed output, credit for compiling, full credit for passing tests, a bonus for clippy-clean), and the problem banks are kept apart deliberately: 354 Rust-translated MBPP problems for training, verified compilable by the harness, and 158 HumanEvalPack-Rust problems never trained on for evaluation, with the contamination consequences for the Python retention metrics documented rather than discovered.
| 158 unseen Rust problems | fine-tuned | + GRPO | delta |
|---|---|---|---|
| pass rate | 36.1% | 38.6% | +2.5 |
| compile rate | 58.2% | 62.7% | +4.4 |
| format rate | 94.9% | 97.5% | +2.5 |
The proof-of-concept run was deliberately conservative, 120 steps of 128 completions in about 52 minutes, and the training reward stayed flat while every held-out axis moved in the reward's gradient direction, which is the pattern of a policy improving on what the reward actually measures rather than overfitting its training prompts. The obvious levers for a full run are written down next to the result: more steps, a curriculum filter that drops zero-variance groups, and a heavier clippy weight, since zero percent of passing solutions from either model are clippy-clean, a wide-open headroom for exactly the idiomatic-Rust objective this project exists for.
Limitations
This is one supervised run and one conservative RL proof-of-concept, not a sweep. The HumanEval regression is diagnosed but not yet fixed, and the fix list is speculation until V2 runs. Compile-pass on the crate-internal holdout is structurally unavailable, so the domain gain rests on edit similarity and exact match, which are weaker evidence than the compiler. Absolute HumanEval and MBPP numbers are protocol-dependent, which is why only the before-and-after deltas under an identical protocol are treated as the metric of record. And the GRPO evaluation bank is 158 problems, big enough to see consistent movement, small enough that a couple of points ride on a handful of solutions.
Links
- Source on GitHub
- The curated crate allowlist and license policy
- The primary training config (memory budget and decisions annotated inline)
- The build report (every phase, every failure, every fix, in order)