LLM Fine-Tuning

Colab

Unsloth

Pytorch

What is LLM Fine-Tuning?

Fine-tuning is a machine learning technique where a pre-trained Large Language Model (LLM) is further trained on a smaller, task-specific dataset to adapt it for particular applications or domains. Instead of training a model from scratch, fine-tuning leverages the general knowledge already learned by the base model and specializes it for specific use cases.

LLM fine-tuning is a powerful technique that enables AI engineers to create specialized, high-performing models for specific applications while leveraging the broad knowledge of pre-trained models.

LoRA: Low-Rank Adaptation of Large Language Models - Original Microsoft Paper Summary

The original LoRA paper, titled "LoRA: Low-Rank Adaptation of Large Language Models," was authored by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and others from Microsoft Research

ArXiv Paper: https://arxiv.org/abs/2106.09685

The Low-Rank Hypothesis

The authors hypothesized that the change in weights during model adaptation has a low "intrinsic rank," meaning that the weight updates can be represented using matrices of much lower dimensionality than the original weight matrices. This insight led to the development of the LoRA approach, which dramatically reduces the number of trainable parameters needed for fine-tuning

Technical Methodology

LoRA works by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. Instead of updating the original weight matrix W, LoRA adds a low-rank decomposition ΔW = BA, where B and A are much smaller matrices with rank r << min(input_dim, output_dim)

The core equation is: h = Wx + ΔWx = Wx + BAx, where:

W is the frozen pre-trained weight matrix

B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) are the trainable low-rank matrices

r is the rank (typically r << min(d,k))

The LoRA paper represents a paradigm shift in how we approach fine-tuning large language models. By demonstrating that effective adaptation can be achieved with minimal parameter updates, it opened the door to more accessible and efficient model customization. The work has inspired numerous follow-up research directions and has become essential knowledge for AI engineers working with large language models in production environments.

What is LLM Quantization?

Quantization is a crucial optimization technique that reduces the precision of model weights and activations from higher-precision formats (like 32-bit floating point) to lower-precision formats (such as 8-bit or 4-bit integers). This process dramatically reduces memory usage, storage requirements, and computational costs while maintaining acceptable model performance

Modern frameworks like PyTorch, Hugging Face Transformers, and specialized tools like llama.cpp provide comprehensive quantization support. These tools make it easier for engineers to implement quantization without deep technical expertise in the underlying algorithms.

Engineers should carefully evaluate quantized models using appropriate benchmarks and real-world test cases to ensure acceptable performance for their specific applications. This includes testing edge cases and monitoring for potential degradation in model capabilities.

Benefits of Fine-Tuning with Unsloth

Unsloth is an open-source framework purpose-built for fast and efficient fine-tuning of large language models (LLMs). It provides an optimized training backend, making fine-tuning possible even on limited hardware setups by drastically improving training speed and memory efficiency.

Link to unsloth.ai

The model

The model used in this project comes from unsloth. Unsloth has model catalog for Dynamic GGUF, 4-bit, 16-bit models.

The model for this project is llama-3-8b-bnb-4bit

The dataset

Datasets play a critical role in Large Language Model (LLM) quantization, particularly in post-training quantization (PTQ) methods where calibration data is essential for determining optimal quantization parameters. The calibration dataset serves as a representative sample that helps the quantization algorithm understand the typical activation patterns and weight distributions of the model during inference

The dataset used in this project comes from hugging face. yahma/alpaca-cleaned

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.This is a cleaned version of the original Alpaca Dataset released by Stanford.

Hardware considerations and Google Colab

Fine-tuning LLMs is extremely memory-intensive. For full fine-tuning with half precision and using 8-bit optimizers, a 7B model might require approximately 70GB of VRAM (14GB for model weights + 42GB for gradients + 14GB for optimizer states).Most practitioners use high-end GPUs like A100, V100, and H100 for fine-tuning 7B and 13B parameter models. These GPUs provide the necessary VRAM and compute power for efficient training, with A100s offering 40-80GB VRAM and H100s providing even more capacity.

Google Colab's free VMs have hard limits regarding RAM and VRAM usage. The free tier typically provides limited GPU access with unclear usage limits over multiple sessions, making it challenging for extensive LLM fine-tuning project. offers various GPU tiers including T4, V100, and A100 options through Colab Pro and Pro+ subscriptions. However, even with paid tiers, fine-tuning large models remains challenging due to memory constraints and session limitations. Despite limitations, Colab can be useful for experimenting with parameter-efficient fine-tuning methods like LoRA on smaller models (7B parameters or less). Researchers have successfully demonstrated fine-tuning techniques on single GPUs in Colab.

Prerequisites

Google colab account

Hugging Face account

https://wandb.ai/home subscription to see reports for training

Notebook in Google Colab

Fine-Tuning LLama 3 8B.ipynb

Below is a step by step explanation of the notebook:

Pytorch Installation

%%capture

import torch

major_version, minor_version = torch.cuda.get_device_capability()

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

if major_version >= 8:

!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

else:

!pip install --no-deps xformers trl peft accelerate bitsandbytes

pass

Next we need to prepare to load a range of quantized language models, including a new 15 trillion token LLama-3 model, optimized for memory efficiency with 4-bit quantization

from unsloth import FastLanguageModel

import torch

max_seq_length = 2048 # Choose any! Llama 3 is up to 8k

dtype = None

load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [

"unsloth/mistral-7b-bnb-4bit",

"unsloth/mistral-7b-instruct-v0.2-bnb-4bit",

"unsloth/llama-2-7b-bnb-4bit",

"unsloth/gemma-7b-bnb-4bit",

"unsloth/gemma-7b-it-bnb-4bit",

"unsloth/gemma-2b-bnb-4bit",

"unsloth/gemma-2b-it-bnb-4bit",

"unsloth/llama-3-8b-bnb-4bit",

]

model, tokenizer = FastLanguageModel.from_pretrained(

model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)

max_seq_length = max_seq_length,

dtype = dtype,

load_in_4bit = load_in_4bit,

# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf

)

Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.

model = FastLanguageModel.get_peft_model(

model,

r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",

"gate_proj", "up_proj", "down_proj",],

lora_alpha = 16,

lora_dropout = 0,

bias = "none",

use_gradient_checkpointing = "unsloth",

random_state = 3407,

use_rslora = False,

loftq_config = None,

)

Data Prep

We now use the Alpaca dataset from yahma, which is a filtered version of 52K of the original Alpaca dataset. You can replace this code section with your own data prep.

Then, we define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.

# this is basically the system prompt

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

{}

### Input:

{}

### Response:

{}"""

EOS_TOKEN = tokenizer.eos_token # do not forget this part!

def formatting_prompts_func(examples):

instructions = examples["instruction"]

inputs = examples["input"]

outputs = examples["output"]

texts = []

for instruction, input, output in zip(instructions, inputs, outputs):

text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN # without this token generation goes on forever!

texts.append(text)

return { "text" : texts, }

pass

from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split = "train")

dataset = dataset.map(formatting_prompts_func, batched = True,)

Train the model

Set num_train_epochs=1 for a full run, and turn off max_steps=None. At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared

from trl import SFTTrainer

from transformers import TrainingArguments

trainer = SFTTrainer(

model = model,

tokenizer = tokenizer,

train_dataset = dataset,

dataset_text_field = "text",

max_seq_length = max_seq_length,

dataset_num_proc = 2,

packing = False, # Can make training 5x faster for short sequences.

args = TrainingArguments(

per_device_train_batch_size = 2,

gradient_accumulation_steps = 4,

warmup_steps = 5,

max_steps = 20, # increase this to make the model learn "better"

num_train_epochs=1,

learning_rate = 2e-4,

fp16 = not torch.cuda.is_bf16_supported(),

bf16 = torch.cuda.is_bf16_supported(),

logging_steps = 1,

optim = "adamw_8bit",

weight_decay = 0.01,

lr_scheduler_type = "linear",

seed = 3407,

output_dir = "outputs",

)

Start training

We're now kicking off the actual training of our model, which will spit out some statistics showing us how well it learns

trainer_stats = trainer.train()

==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1

\\ /| Num examples = 51,760 | Num Epochs = 1 | Total steps = 20

O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4

\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8

"-____-" Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)

wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)

wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models

wandb: Paste an API key from your profile and hit enter: ··········

wandb: WARNING If you're specifying your api key in code, ensure this code is not shared publicly.

wandb: WARNING Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.

wandb: No netrc file found, creating one.

wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc

wandb: Currently logged in as: ufemitoarduino (ufemitoarduino-none) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin

Tracking run with wandb version 0.21.0

Run data is saved locally in /content/wandb/run-20250724_005438-ahq99ys9

Syncing run outputs to Weights & Biases (docs)

View project at https://wandb.ai/ufemitoarduino-none/huggingface

View run at https://wandb.ai/ufemitoarduino-none/huggingface/runs/ahq99ys9

Unsloth: Will smartly offload gradients to save VRAM!

Inference

Let's run the model!

FastLanguageModel.for_inference(model)

inputs = tokenizer(

[

alpaca_prompt.format(

"List the prime numbers contained within the range.", # instruction

"1-50", # input

"", # output - leave this blank for generation!

)

], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)

tokenizer.batch_decode(outputs)

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

FastLanguageModel.for_inference(model)

inputs = tokenizer(

[

alpaca_prompt.format(

"Convert these binary numbers to decimal.", # instruction

"1010, 1101, 1111", # input

"", # output - leave this blank for generation!

)

], return_tensors = "pt").to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)

_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Conclusion

LLM fine-tuning is a powerful technique that enables AI engineers to create specialized, high-performing models for specific applications while leveraging the broad knowledge of pre-trained models. Success requires careful attention to data quality, training methodology, and evaluation practices. As the field evolves, new techniques continue to make fine-tuning more efficient and accessible, making it an essential skill for AI engineers working with language models.

References & Tools

Fine-Tuning LLama 3 8B.ipynb notebook

tutorial for Fine-tuning

Page updated

Google Sites

Report abuse