Low-Rank Adaptation (LoRA) provides a practical method for fine-tuning a pre-trained transformer model on a downstream task. Using the peft library from Hugging Face, this approach significantly reduces the number of trainable parameters compared to full fine-tuning. The process becomes faster and requires less memory, without substantial compromises in performance for many tasks.We will walk through the essential steps: setting up the environment, preparing the data, configuring LoRA, training the adapter, and performing inference using the fine-tuned model.1. Setup and EnvironmentFirst, ensure you have the necessary libraries installed. We'll primarily use transformers for the base model and training utilities, peft for implementing LoRA, datasets for data handling, and accelerate to simplify running PyTorch code on any infrastructure.pip install transformers datasets peft accelerate torchNow, let's import the required modules and define our base model checkpoint. For this example, we'll use a relatively small sequence-to-sequence model like google/flan-t5-small and fine-tune it on a summarization task. Using a smaller model makes the process quicker and accessible even without high-end GPUs.import torch from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training # Define the base model checkpoint model_checkpoint = "google/flan-t5-small" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # Load the base model. We use device_map="auto" to leverage accelerate for placing layers across devices. # We also load in 8-bit for further memory saving, compatible with LoRA. # Note: 8-bit loading is optional but useful for larger models. # If not using 8-bit, remove load_in_8bit and prepare_model_for_kbit_training model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, load_in_8bit=True, device_map="auto") # Prepare the model for k-bit training (if using quantization) # This step is needed when loading models in 8-bit or 4-bit model = prepare_model_for_kbit_training(model)2. Data PreparationWe need a dataset suitable for our chosen task: summarization. The samsum dataset, containing dialogues and their summaries, is a good choice. We'll load it using the datasets library and preprocess it. For efficiency, we'll only use a small fraction of the dataset for this demonstration.# Load the dataset dataset_name = "samsum" dataset = load_dataset(dataset_name, split="train[:1%]") # Using only 1% for demo dataset = dataset.train_test_split(test_size=0.1) # Create train/test splits print(f"Train dataset size: {len(dataset['train'])}") print(f"Test dataset size: {len(dataset['test'])}") # Example: Train dataset size: 132 # Example: Test dataset size: 15 # Preprocessing function max_input_length = 512 max_target_length = 128 def preprocess_function(examples): # Add prefix for T5 models inputs = ["summarize: " + doc for doc in examples["dialogue"]] model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length") # Setup the tokenizer for targets with tokenizer.as_target_tokenizer(): labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True, padding="max_length") model_inputs["labels"] = labels["input_ids"] # Replace tokenizer.pad_token_id in the labels by -100 to ignore padding in the loss calculation model_inputs["labels"] = [ [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_inputs["labels"] ] return model_inputs # Apply preprocessing tokenized_datasets = dataset.map(preprocess_function, batched=True) # Remove columns not needed for training tokenized_datasets = tokenized_datasets.remove_columns(["id", "dialogue", "summary"]) print(f"Columns in tokenized dataset: {tokenized_datasets['train'].column_names}") # Example: Columns in tokenized dataset: ['input_ids', 'attention_mask', 'labels']We also need a data collator to handle dynamic padding during batch creation. DataCollatorForSeq2Seq is suitable for sequence-to-sequence tasks.# Create data collator data_collator = DataCollatorForSeq2Seq( tokenizer, model=model, label_pad_token_id=-100, # Important: ensure labels are padded correctly pad_to_multiple_of=8 # Optional: optimizes hardware usage )3. LoRA ConfigurationThis is where we define how LoRA will modify the base model. We use the LoraConfig class from the peft library.r: The rank of the low-rank matrices ($A$ and $B$). A smaller r means fewer trainable parameters but might capture less task-specific information. Common values range from 4 to 32.lora_alpha: The scaling factor for the LoRA updates. It's often set equal to r or 2*r. The update is scaled by $\frac{\alpha}{r}$.target_modules: A list of module names within the base model where the LoRA matrices will be injected. For T5 models, targeting the query (q) and value (v) projections in the self-attention mechanism is standard practice. You can find these names by inspecting model.named_modules().lora_dropout: Dropout applied to the LoRA layers.bias: Specifies which biases to train. "none" is common, freezing all original biases and not adding new ones.task_type: Defines the model type and task. For flan-t5, it's TaskType.SEQ_2_SEQ_LM.# Define LoRA configuration lora_config = LoraConfig( r=16, # Rank of the update matrices lora_alpha=32, # Scaling factor target_modules=["q", "v"], # Apply LoRA to query and value projections lora_dropout=0.05, # Dropout probability bias="none", # Do not train biases task_type=TaskType.SEQ_2_SEQ_LM # Task type for sequence-to-sequence models )4. Wrapping the Model with PEFTNow, we apply the LoRA configuration to our base model using get_peft_model.# Get the PEFT model peft_model = get_peft_model(model, lora_config) # Print the number of trainable parameters peft_model.print_trainable_parameters() # Example output: trainable params: 884,736 || all params: 77,822,464 || trainable%: 1.13685...Notice the significant reduction! We are only training around 1% of the total parameters. This drastically reduces memory requirements and speeds up training compared to updating all 77 million parameters of flan-t5-small.digraph LoRA { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_0 { label = "Original Transformer Layer"; style=filled; color="#dee2e6"; W [label="Pre-trained Weight W (Frozen)", fillcolor="#adb5bd"]; } subgraph cluster_1 { label = "LoRA Modification (Trainable)"; style=filled; color="#dee2e6"; A [label="Low-Rank Matrix A (Trainable)", fillcolor="#a5d8ff"]; B [label="Low-Rank Matrix B (Trainable)", fillcolor="#a5d8ff"]; B -> A [label="r", arrowhead=none]; } Input [label="Input x", shape=ellipse, fillcolor="#ced4da"]; Output [label="Output y", shape=ellipse, fillcolor="#ced4da"]; Sum [label="+", shape=circle, fillcolor="#ffec99"]; Input -> W [label="x * W"]; W -> Sum; Input -> B [label="x * B"]; A -> Sum [label="(x * B) * A * (α/r)"]; Sum -> Output [label="y = x*W + x*B*A*(α/r)"]; }Diagram illustrating how LoRA injects trainable low-rank matrices (A and B) alongside the frozen pre-trained weight matrix (W). Only A and B are updated during training.5. Training the LoRA AdapterWe use the standard Trainer from the transformers library. The setup is almost identical to full fine-tuning, but the Trainer will automatically handle the PEFT model, only updating the LoRA parameters.# Define Training Arguments output_dir = "flan-t5-small-samsum-lora" training_args = TrainingArguments( output_dir=output_dir, auto_find_batch_size=True, # Automatically find a suitable batch size learning_rate=1e-3, # Higher learning rate typical for LoRA num_train_epochs=3, # Number of training epochs logging_strategy="epoch", # Log metrics every epoch save_strategy="epoch", # Save checkpoint every epoch # evaluation_strategy="epoch", # Evaluate every epoch if eval data is available report_to="none", # Disable reporting to wandb/tensorboard for this example # Use fp16 for faster training if supported # fp16=torch.cuda.is_available(), ) # Create Trainer instance trainer = Trainer( model=peft_model, # Pass the PEFT model args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], # Optional: Pass eval dataset data_collator=data_collator, tokenizer=tokenizer, ) # Set LoRA layers to trainable explicitly (sometimes needed) peft_model.config.use_cache = False # Disable caching for training # Start training print("Starting LoRA training...") trainer.train() print("Training finished.")Training should be significantly faster and require less GPU memory than full fine-tuning the flan-t5-small model.6. Saving the AdapterAfter training, we save the trained LoRA adapter weights. Importantly, this saves only the adapter parameters (matrices A and B for each targeted module), not the entire base model. This makes the saved artifact very small.# Define path to save the adapter adapter_path = f"{output_dir}/final_adapter" # Save the adapter weights peft_model.save_pretrained(adapter_path) tokenizer.save_pretrained(adapter_path) # Save tokenizer alongside adapter print(f"LoRA adapter saved to: {adapter_path}") # You can check the size of the saved adapter - it should be relatively small (MBs). # For example, using: !ls -lh {adapter_path}7. Inference with the LoRA AdapterTo use the fine-tuned model for inference, we first load the original base model and then load the LoRA adapter weights on top of it.from peft import PeftModel, PeftConfig # Load the base model again (if not already in memory) base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # Load the PEFT model with the saved adapter lora_model = PeftModel.from_pretrained(base_model, adapter_path) lora_model = lora_model.to("cuda" if torch.cuda.is_available() else "cpu") # Ensure model is on correct device lora_model.eval() # Set model to evaluation mode # Prepare a sample input from the test set (or any new dialogue) sample_idx = 5 dialogue = dataset['test'][sample_idx]['dialogue'] reference_summary = dataset['test'][sample_idx]['summary'] input_text = "summarize: " + dialogue input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(lora_model.device) print("Dialogue:") print(dialogue) print("\nReference Summary:") print(reference_summary) # Generate summary using the LoRA model with torch.no_grad(): outputs = lora_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9) generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True) print("\nGenerated Summary (LoRA):") print(generated_summary) # Optional: Compare with base model generation # base_model.to(lora_model.device) # base_model.eval() # with torch.no_grad(): # base_outputs = base_model.generate(input_ids=input_ids, max_new_tokens=100) # base_summary = tokenizer.decode(base_outputs[0], skip_special_tokens=True) # print("\nGenerated Summary (Base Model):") # print(base_summary)You should observe that the summary generated by the LoRA-adapted model is more aligned with the summarization task than the output from the original base model (which might just repeat parts of the input or give irrelevant responses without fine-tuning).8. Merging LoRA Weights (Optional)For deployment scenarios where you don't need to switch between different adapters frequently, you can merge the LoRA weights directly into the base model's weights. This creates a standard transformers model that incorporates the fine-tuning adjustments. After merging, the peft library is no longer needed for inference.# Merge the adapter weights into the base model # merged_model = lora_model.merge_and_unload() # Now 'merged_model' is a standard transformers model with the LoRA updates applied. # It can be saved and loaded like any regular Hugging Face model. # merged_model.save_pretrained(f"{output_dir}/final_merged_model") # tokenizer.save_pretrained(f"{output_dir}/final_merged_model") # Note: After merging, the model size increases back to the original base model size, # as the low-rank updates are now part of the main weight matrices.This practical exercise demonstrated how to apply LoRA for efficient fine-tuning. You successfully adapted a pre-trained model using significantly fewer trainable parameters, configured the LoRA parameters, trained the adapter, and performed inference. Experimenting with different ranks (r), lora_alpha values, and target_modules can help optimize performance for specific tasks and datasets.