Parameter-efficient fine-tuning with LoRA

Notes · adapting large models cheaply · Feb 2024

Fine-tuning a large model the obvious way updates every weight, and with Adam it stores two extra optimizer values per parameter, so adapting a multi-billion-parameter model needs almost as much memory as training it and leaves a full-size copy per task. LoRA (Hu et al. 2021) avoids nearly all of that by freezing the pretrained weights and learning only a small low-rank update. I use it in my text-to-image project to teach a custom style cheaply.

The method rests on an empirical fact about fine-tuning, not a trick: the change a model needs to specialize to a task lives in a very low-dimensional subspace, so a low-rank matrix can capture it.

The idea

Freeze the pretrained weight $W \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ and add a trainable low-rank update $\Delta W = BA$ with $A \in \mathbb{R}^{r \times d_\text{in}}$, $B \in \mathbb{R}^{d_\text{out} \times r}$, and rank $r \ll d$. The layer computes $h = Wx + \tfrac{\alpha}{r} BA\,x$, and only $A$ and $B$ receive gradients.

Two initialization choices matter, and each has a reason. $A$ is random while $B$ starts at zero, so $\Delta W = 0$ at the start and the adapted model is exactly the pretrained one; training then departs smoothly instead of jolting the model with a random perturbation. The scalar $\tfrac{\alpha}{r}$ rescales the update so that changing the rank does not force you to re-tune the learning rate.

Why a low-rank update is enough

The justification is the intrinsic-dimension result (Aghajanyan et al. 2020): pretrained models can be fine-tuned inside a tiny, randomly chosen subspace and still reach most of full-fine-tuning quality, which means the task-specific update is approximately low-rank. LoRA simply makes that low-rank structure explicit and learnable. The payoff follows from the shapes: $BA$ has $r(d_\text{in} + d_\text{out})$ parameters instead of $d_\text{in} d_\text{out}$, often a thousandfold fewer.

Properties worth remembering

  • Tiny trainable footprint, usually well under one percent of the model, so the optimizer state, which is the real memory hog, shrinks in proportion.
  • Swappable adapters: one frozen base serves many tasks, each a few megabytes, loaded on demand.
  • No inference cost: because $\Delta W = BA$ has the same shape as $W$, you can fold it in as $W' = W + \tfrac{\alpha}{r} BA$ after training, so a merged model runs exactly as fast as the original, unlike adapter layers that stay in the forward path.
  • Placement: LoRA is usually applied to the attention projection matrices, sometimes the feed-forward layers, where adaptation has the most leverage.

QLoRA

QLoRA (Dettmers et al. 2023) pushes memory further by quantizing the frozen base to 4-bit, using an NF4 data type suited to normally-distributed weights, and training the LoRA adapters on top in higher precision. Because the base barely occupies memory and only the small adapters and their optimizer state need full precision, models that would never otherwise fit can be fine-tuned on a single consumer GPU. Why it does not hurt quality: the base is only read in the forward pass, so its quantization error is small relative to the correction the adapters learn.

The PEFT family

LoRA sits in a family of parameter-efficient fine-tuning methods. Adapters (Houlsby et al. 2019) insert small trainable bottleneck layers between frozen ones; prefix tuning (Li & Liang 2021) and prompt tuning (Lester et al. 2021) prepend trainable vectors and leave the model untouched. LoRA's edge over adapters is that it adds no layers, so once merged it has zero inference overhead; its edge over prompt methods is that it can express larger changes. DoRA (Liu et al. 2024) refines LoRA by splitting each weight into a magnitude and a direction and applying the low-rank update to the direction, closing much of the remaining gap to full fine-tuning.

Step by step

  1. Freeze the pretrained linear layer.
  2. Add two small matrices, A (random) and B (zero), of rank r.
  3. Scale their product by alpha/r and add it to the frozen output.
  4. Train only A and B; the base never updates.
  5. Optionally merge BA into W after training to remove all inference overhead.
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_f, out_f, r=8, alpha=16):
        super().__init__()
        self.base = nn.Linear(in_f, out_f)
        for p in self.base.parameters():
            p.requires_grad_(False)            # freeze the pretrained weights
        self.A = nn.Parameter(torch.randn(r, in_f) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_f, r))   # B = 0 so the adapter starts as a no-op
        self.scale = alpha / r

    def forward(self, x):
        return self.base(x) + self.scale * (x @ self.A.t() @ self.B.t())

Complexity (time and space)

Per adapted layer, trainable parameters drop from $d_\text{in} d_\text{out}$ to $r(d_\text{in} + d_\text{out})$, and the dominant saving is that Adam's optimizer state now covers only those few parameters. The extra forward compute is one small matmul through $A$ then $B$, negligible against the frozen $W$, and it disappears entirely after merging. QLoRA adds 4-bit dequantization of $W$ in the forward pass in exchange for roughly a fourfold reduction in base-model memory.

Worked example

A single 768x768 layer at rank 8: the base has about 590k frozen weights, while LoRA trains only the two thin matrices, about two percent of the total, and the adapter is a no-op at initialization:

import torch
layer = LoRALinear(768, 768, r=8)
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in layer.parameters())
print(trainable, total, round(100 * trainable / total, 2))   # 12288 602880 2.04

x = torch.randn(2, 768)
print(torch.allclose(layer(x), layer.base(x)))               # True (B = 0 at init)

Follow-up questions

  • Why initialize B to zero? So the update BA is zero at the start, leaving the adapted model identical to the pretrained one and letting training depart smoothly rather than with a random shock.
  • Why does a low-rank update suffice? Fine-tuning changes lie in a very low intrinsic-dimension subspace (Aghajanyan et al. 2020), so a small rank captures most of the needed adjustment.
  • Why does LoRA add no inference latency? The update BA has the same shape as W and can be merged into the weights after training, unlike adapter layers that remain in the forward path.
  • What is the role of alpha/r? It rescales the update so the effective learning rate is roughly invariant to the chosen rank.
  • How does QLoRA fit big models on one GPU? It stores the frozen base in 4-bit and trains only the small high-precision adapters, so both the base memory and the optimizer state shrink dramatically.
  • LoRA vs prompt tuning? LoRA edits weights (more expressive, mergeable); prompt and prefix tuning only prepend learned vectors and leave the weights frozen.

References

  1. Houlsby et al., Parameter-Efficient Transfer Learning for NLP (adapters, 2019).
  2. Aghajanyan et al., Intrinsic Dimensionality Explains the Effectiveness of LM Fine-Tuning (2020).
  3. Li & Liang, Prefix-Tuning (2021).
  4. Lester et al., The Power of Scale for Parameter-Efficient Prompt Tuning (2021).
  5. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021).
  6. Dettmers et al., QLoRA (2023).
  7. Liu et al., DoRA: Weight-Decomposed Low-Rank Adaptation (2024).