What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes a model's weights and trains small low-rank adapter matrices instead, so you customize a model with a tiny fraction of the compute and storage of full fine-tuning.

LoRA, short for Low-Rank Adaptation, is the most widely used parameter-efficient fine-tuning technique. Instead of updating all of a model's billions of weights, LoRA freezes the base model and injects small trainable matrices, the low-rank adapters, into its layers. Only those adapters are trained, which slashes the memory and compute needed to fine-tune and produces a tiny adapter file (often megabytes) rather than a full model copy. Because the base stays frozen, you can keep many LoRA adapters for different tasks or customers and swap them at serve time, even merging them back into the base when desired. QLoRA combines LoRA with 4-bit quantization of the frozen base, making it possible to fine-tune large models on a single consumer GPU. LoRA is how most teams customize open-weight models today: cheap to train, cheap to store, easy to version. It contrasts with full fine-tuning (more expensive, one big artifact) and with in-context approaches like retrieval or a memory layer. The distinction matters for agents: LoRA changes the model's behavior and style, while a memory layer like Glen changes what the model knows at runtime, and most production systems use both, a fine-tuned base plus durable retrieved context.