What is LLM Fine-Tuning?
Fine-tuning is a machine learning technique where a pre-trained Large Language Model (LLM) is further trained on a smaller, task-specific dataset to adapt it for particular applications or domains. Instead of training a model from scratch, fine-tuning leverages the general knowledge already learned by the base model and specializes it for specific use cases.
LLM fine-tuning is a powerful technique that enables AI engineers to create specialized, high-performing models for specific applications while leveraging the broad knowledge of pre-trained models.
LoRA: Low-Rank Adaptation of Large Language Models - Original Microsoft Paper Summary
The original LoRA paper, titled "LoRA: Low-Rank Adaptation of Large Language Models," was authored by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and others from Microsoft Research
ArXiv Paper: https://arxiv.org/abs/2106.09685
The Low-Rank Hypothesis
The authors hypothesized that the change in weights during model adaptation has a low "intrinsic rank," meaning that the weight updates can be represented using matrices of much lower dimensionality than the original weight matrices. This insight led to the development of the LoRA approach, which dramatically reduces the number of trainable parameters needed for fine-tuning
Technical Methodology
LoRA works by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. Instead of updating the original weight matrix W, LoRA adds a low-rank decomposition ΔW = BA, where B and A are much smaller matrices with rank r << min(input_dim, output_dim)
The core equation is: h = Wx + ΔWx = Wx + BAx, where:
W is the frozen pre-trained weight matrix
B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) are the trainable low-rank matrices
r is the rank (typically r << min(d,k))
The LoRA paper represents a paradigm shift in how we approach fine-tuning large language models. By demonstrating that effective adaptation can be achieved with minimal parameter updates, it opened the door to more accessible and efficient model customization. The work has inspired numerous follow-up research directions and has become essential knowledge for AI engineers working with large language models in production environments.
What is LLM Quantization?
Quantization is a crucial optimization technique that reduces the precision of model weights and activations from higher-precision formats (like 32-bit floating point) to lower-precision formats (such as 8-bit or 4-bit integers). This process dramatically reduces memory usage, storage requirements, and computational costs while maintaining acceptable model performance
Modern frameworks like PyTorch, Hugging Face Transformers, and specialized tools like llama.cpp provide comprehensive quantization support. These tools make it easier for engineers to implement quantization without deep technical expertise in the underlying algorithms.
Engineers should carefully evaluate quantized models using appropriate benchmarks and real-world test cases to ensure acceptable performance for their specific applications. This includes testing edge cases and monitoring for potential degradation in model capabilities.
Benefits of Fine-Tuning with Unsloth
Unsloth is an open-source framework purpose-built for fast and efficient fine-tuning of large language models (LLMs). It provides an optimized training backend, making fine-tuning possible even on limited hardware setups by drastically improving training speed and memory efficiency.
Link to unsloth.ai
The model
The model used in this project comes from unsloth. Unsloth has model catalog for Dynamic GGUF, 4-bit, 16-bit models.
The model for this project is llama-3-8b-bnb-4bit
The dataset
Datasets play a critical role in Large Language Model (LLM) quantization, particularly in post-training quantization (PTQ) methods where calibration data is essential for determining optimal quantization parameters. The calibration dataset serves as a representative sample that helps the quantization algorithm understand the typical activation patterns and weight distributions of the model during inference
The dataset used in this project comes from hugging face. yahma/alpaca-cleaned
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.This is a cleaned version of the original Alpaca Dataset released by Stanford.
Hardware considerations and Google Colab
Fine-tuning LLMs is extremely memory-intensive. For full fine-tuning with half precision and using 8-bit optimizers, a 7B model might require approximately 70GB of VRAM (14GB for model weights + 42GB for gradients + 14GB for optimizer states).Most practitioners use high-end GPUs like A100, V100, and H100 for fine-tuning 7B and 13B parameter models. These GPUs provide the necessary VRAM and compute power for efficient training, with A100s offering 40-80GB VRAM and H100s providing even more capacity.
Google Colab's free VMs have hard limits regarding RAM and VRAM usage. The free tier typically provides limited GPU access with unclear usage limits over multiple sessions, making it challenging for extensive LLM fine-tuning project. offers various GPU tiers including T4, V100, and A100 options through Colab Pro and Pro+ subscriptions. However, even with paid tiers, fine-tuning large models remains challenging due to memory constraints and session limitations. Despite limitations, Colab can be useful for experimenting with parameter-efficient fine-tuning methods like LoRA on smaller models (7B parameters or less). Researchers have successfully demonstrated fine-tuning techniques on single GPUs in Colab.
Below is a step by step explanation of the notebook:
Pytorch Installation
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
!pip install --no-deps xformers trl peft accelerate bitsandbytes
pass
Next we need to prepare to load a range of quantized language models, including a new 15 trillion token LLama-3 model, optimized for memory efficiency with 4-bit quantization
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Llama 3 is up to 8k
dtype = None
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
fourbit_models = [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit",
"unsloth/gemma-7b-it-bnb-4bit",
"unsloth/gemma-2b-bnb-4bit",
"unsloth/gemma-2b-it-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit",
]
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
Data Prep
We now use the Alpaca dataset from yahma, which is a filtered version of 52K of the original Alpaca dataset. You can replace this code section with your own data prep.
Then, we define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.
# this is basically the system prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # do not forget this part!
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN # without this token generation goes on forever!
texts.append(text)
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
Train the model
Set num_train_epochs=1 for a full run, and turn off max_steps=None. At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 20, # increase this to make the model learn "better"
num_train_epochs=1,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
Start training
We're now kicking off the actual training of our model, which will spit out some statistics showing us how well it learns
trainer_stats = trainer.train()
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 51,760 | Num Epochs = 1 | Total steps = 20
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
"-____-" Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter: ··········
wandb: WARNING If you're specifying your api key in code, ensure this code is not shared publicly.
wandb: WARNING Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.
wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: Currently logged in as: ufemitoarduino (ufemitoarduino-none) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
Tracking run with wandb version 0.21.0
Run data is saved locally in /content/wandb/run-20250724_005438-ahq99ys9
Syncing run outputs to Weights & Biases (docs)
View project at https://wandb.ai/ufemitoarduino-none/huggingface
View run at https://wandb.ai/ufemitoarduino-none/huggingface/runs/ahq99ys9
Unsloth: Will smartly offload gradients to save VRAM!
Inference
Let's run the model!
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
alpaca_prompt.format(
"List the prime numbers contained within the range.", # instruction
"1-50", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)
You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
alpaca_prompt.format(
"Convert these binary numbers to decimal.", # instruction
"1010, 1101, 1111", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
Conclusion
LLM fine-tuning is a powerful technique that enables AI engineers to create specialized, high-performing models for specific applications while leveraging the broad knowledge of pre-trained models. Success requires careful attention to data quality, training methodology, and evaluation practices. As the field evolves, new techniques continue to make fine-tuning more efficient and accessible, making it an essential skill for AI engineers working with language models.