What it takes to train an LLM

Training a large language model is, at its core, fitting one objective, predict the next token, over an enormous corpus. That single self-supervised target is enough to induce grammar, world knowledge, and patterns of reasoning, because compressing text well demands all of them (Brown et al. 2020). The conceptual recipe is short; almost all of the difficulty is engineering: curating the data, fitting the model across many machines, and keeping a run that lasts days or weeks numerically stable.

It pays to separate the pipeline into stages, because each has distinct failure modes and the cost of a mistake grows with scale. Data is gathered and filtered, turned into tokens, fed to a loss that is optimized, and finally the pretrained model is adapted to follow instructions. The most transparent account of the whole pipeline end to end is the OLMo report (Groeneveld et al. 2024), which releases data, code, and logs alongside the weights.

Data: the dominant lever

Quality is set more by data than by architecture, so curation is where much of the real work lives. Web crawl, books, and code are filtered for quality, then deduplicated, and the reason matters: near-duplicate documents make the model waste capacity memorizing repeats and, worse, leak test sets into training so evaluations overstate ability. The mixture is tuned deliberately, since adding code improves reasoning and adding many languages improves multilingual transfer (Llama 3, 2024). Because a model can only learn what is present, the corpus is effectively part of the architecture.

Tokenization

Models read tokens, subword units from a tokenizer, not characters or words. Byte-pair encoding (Sennrich et al. 2016) starts from bytes and repeatedly merges the most frequent adjacent pair into a new symbol, building a fixed vocabulary of common fragments; SentencePiece (Kudo & Richardson 2018) packages this to run directly on raw text. The why behind subwords is a balance: frequent words become one token while rare words split into a few pieces, and a byte-level fallback guarantees nothing is ever out of vocabulary. Vocabulary size is itself a trade-off, since a larger vocabulary shortens sequences and makes attention cheaper but enlarges the embedding and output-projection matrices.

The objective

At every position the model emits a distribution over the vocabulary, and training minimizes the average cross-entropy of the true next token,

$$\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_{<t}\right).$$

Cross-entropy is the right loss because minimizing it is exactly maximum-likelihood estimation, and it penalizes confident mistakes sharply (the $-\log p$ term blows up as the assigned probability goes to zero). Causal masking makes every position a valid prediction, so one forward and backward pass over a length-$T$ sequence yields $T$ supervised signals, which is what makes pretraining sample-efficient in wall-clock terms. The loss is usually reported as perplexity, $\mathrm{ppl} = e^{\mathcal{L}}$, which reads as the effective number of tokens the model is choosing between at each step, so halving uncertainty is a multiplicative, not additive, improvement.

Optimization

The optimizer is AdamW (Loshchilov & Hutter 2019). Adam keeps running estimates of the gradient mean $m_t$ and uncentered variance $v_t$ and steps $\theta \leftarrow \theta - \eta\, \hat m_t / (\sqrt{\hat v_t} + \epsilon)$, which rescales each coordinate by its own recent gradient magnitude; the why is that language-model loss surfaces are badly conditioned, with parameters that need very different effective learning rates, and a single global rate would be too large for some directions and too small for others. AdamW's contribution is to apply weight decay separately from that adaptive rescaling, so regularization is not distorted by the per-coordinate scaling. The learning rate uses a short linear warmup, because the early $\hat v_t$ estimates are noisy and a large step then can destabilize the run, followed by a cosine decay to a small final value. Gradient clipping by global norm caps the occasional large batch so one bad step cannot blow up the weights.

Memory and numerical precision

Large models train in bf16, not fp16, because bf16 keeps the full 8-bit exponent of fp32 and therefore the same dynamic range, so gradients rarely overflow or underflow; fp16's 5-bit exponent has a narrow range that forces loss-scaling to stop small gradients from flushing to zero (Micikevicius et al. 2018). Memory is dominated by two things people often forget. Adam's optimizer state is two extra values per parameter, so it can exceed the parameters themselves; ZeRO (Rajbhandari et al. 2020) shards optimizer state, gradients, and parameters across devices to remove that redundancy. And activations, retained for the backward pass, dominate the rest; activation checkpointing stores only a few and recomputes the others, trading roughly one extra forward pass for a large memory saving.

Parallelism

No single accelerator holds a frontier model, so training combines several kinds of parallelism. Data parallelism replicates the model and splits the batch, averaging gradients across replicas; tensor parallelism splits individual weight matrices across devices; and pipeline parallelism assigns different layers to different devices and streams micro-batches through them (Shoeybi et al. 2019). Gradient accumulation reaches a very large effective batch without holding it all at once. Keeping thousands of devices busy and recovering from inevitable failures is itself a major systems effort, which the Llama 3 report documents in unusual detail.

Scaling laws

How to spend compute is governed by empirical power laws. Early work (Kaplan et al. 2020) found loss falls smoothly with parameters, data, and compute. The Chinchilla analysis (Hoffmann et al. 2022) then showed many models were undertrained and that, for a fixed compute budget $C \approx 6ND$ (parameters $N$, tokens $D$), loss is minimized when $N$ and $D$ grow together, roughly twenty tokens per parameter. The why is a constrained optimization: with $C$ fixed, pouring it all into parameters starves the model of data and vice versa, and the optimum balances the two.

After pretraining: SFT, RLHF, and DPO

A pretrained model continues text but does not follow instructions. Supervised fine-tuning on instruction-response pairs teaches the format. Alignment to human preference then follows one of two routes. Classic RLHF (Ouyang et al. 2022) trains a reward model from human comparisons and optimizes the policy against it with reinforcement learning, with a KL penalty to the reference model so it does not drift into degenerate text. DPO (Rafailov et al. 2023) shows that the same preference objective has a closed-form optimum, so you can skip the separate reward model and the RL loop and instead minimize a simple classification loss on preferred-versus-rejected pairs, which is far simpler and more stable.

Step by step: byte-pair encoding

Start with the text as a list of characters.
Count every adjacent pair of symbols.
Merge the most frequent pair everywhere into one new symbol.
Record the merge and repeat to the target vocabulary size.
To tokenize new text, apply the learned merges in order.

from collections import Counter

def get_pairs(tokens):
    return Counter(zip(tokens, tokens[1:]))      # adjacent symbol pairs

def merge(tokens, pair):
    out, i = [], 0
    while i < len(tokens):
        if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
            out.append(tokens[i] + tokens[i + 1])
            i += 2
        else:
            out.append(tokens[i])
            i += 1
    return out

def train_bpe(text, num_merges):
    tokens, merges = list(text), []
    for _ in range(num_merges):
        pairs = get_pairs(tokens)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)
        tokens = merge(tokens, best)
        merges.append(best)
    return tokens, merges

Complexity (time and space)

Training compute is about $C \approx 6ND$ floating-point operations, roughly two for the forward pass and four for the backward pass per parameter per token. Device memory must hold parameters, gradients, and optimizer state (Adam adds two values per parameter, so optimizer state alone is about twice the model), plus activations, which is why ZeRO sharding and activation checkpointing are standard. The naive BPE trainer shown is about $O(\text{merges} \times \text{corpus length})$; production tokenizers use smarter counting to scale.

Worked example

Byte-pair encoding builds frequent fragments from characters. On a tiny repetitive corpus, the first merges assemble the common piece "low" before starting on "est" inside the repeated "newest":

toks, merges = train_bpe("low low low lower lower newest newest newest widest", 6)
print(merges)
# [('l', 'o'), ('lo', 'w'), (' ', 'low'), ('e', 's'), ('es', 't'), (' ', 'n')]

Follow-up questions

Why minimize cross-entropy specifically? It is the negative log-likelihood, so minimizing it is maximum-likelihood estimation of the data distribution, and it punishes confident errors via the $-\log p$ blow-up.
Why warmup the learning rate? Adam's variance estimate $\hat v_t$ is unreliable in the first steps, so a large rate then can destabilize training; warmup ramps in once the estimates settle.
Why does optimizer state dominate memory? Adam stores two extra fp32 values (first and second moment) per parameter, often exceeding the parameters themselves, which is what ZeRO shards.
What exactly does Chinchilla prescribe? For a fixed compute budget $C \approx 6ND$, scale parameters and tokens together (about 20 tokens per parameter); many earlier models were too large for their data.
Why is DPO simpler than RLHF? The RLHF objective has a closed-form optimal policy, so DPO replaces the reward model and RL loop with a single classification loss on preference pairs.
Why deduplicate the corpus? Duplicates waste capacity on memorization and can leak evaluation data into training, inflating benchmark scores.

References

Sennrich et al., Neural Machine Translation of Rare Words with Subword Units (BPE, 2016).
Micikevicius et al., Mixed Precision Training (2018).
Kudo & Richardson, SentencePiece (2018).
Loshchilov & Hutter, Decoupled Weight Decay Regularization (AdamW, 2019).
Shoeybi et al., Megatron-LM (2019).
Rajbhandari et al., ZeRO (2020).
Kaplan et al., Scaling Laws for Neural Language Models (2020).
Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020).
Hoffmann et al., Training Compute-Optimal LLMs (Chinchilla, 2022).
Ouyang et al., Training LMs to Follow Instructions with Human Feedback (InstructGPT, 2022).
Rafailov et al., Direct Preference Optimization (DPO, 2023).
Groeneveld et al., OLMo: Accelerating the Science of Language Models (2024).
Llama Team, The Llama 3 Herd of Models (2024).
DeepSeek-AI, DeepSeek-V3 Technical Report (2024).