Fine-tuning open source LLMs (Llama 3.1, Mistral, Qwen 2.5) is exploding 2024-2026. Cost dropped 10x. Lets SMEs with proprietary data have ultra-specialized LLM for 500-5000€. Here's how + when to actually do it 2026.
TL;DR
- LoRA / QLoRA: modern techniques, 90% full fine-tune efficiency at 1-5% cost.
- Datasets: 500-5000 quality examples > 50K noise examples.
- Total cost: 200-3000€ training (1x) + 50-500€/month hosting.
- Clear ROI vs RAG when: specific style/format, critical latency, high LLM cost.
RAG vs fine-tuning vs prompting
| Need | Recommended solution |
|---|---|
| Dynamic knowledge (docs change) | RAG |
| Specific style or format | Fine-tuning |
| Specialized domain vocabulary | Fine-tuning |
| Multi-task one-shot | Prompting (with few-shot) |
| Reduce inference cost | Fine-tuning (smaller model) |
| Compliance / no data outside | Self-hosted fine-tuning |
| Start fast | Prompting + RAG |
Often: prompting D1 → RAG D7 → fine-tuning D90 if justified.
2026 models to fine-tune
| Model | Sizes | Best use case |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | All-purpose strong |
| Mistral 7B / Mixtral | 7B, 8x22B | Fast, efficient |
| Qwen 2.5 | 7B, 14B, 72B | Multilingual (good Chinese, French) |
| Phi-3.5 | 3.8B, 14B | Small + smart |
| DeepSeek | 7B, 67B | Code + math |
| Gemma 3 | 2B, 9B, 27B | Edge / mobile |
For 2026 SME: Llama 3.1 8B or Qwen 2.5 7B = performance/cost sweet spot.
2026 fine-tuning techniques
Full fine-tuning (avoid)
- Re-trains all parameters
- Cost : 10K-100K€ for Llama 70B
- GPU memory : 80GB+ (A100 / H100)
- "Catastrophic forgetting" risk
Only if truly necessary (rarely).
LoRA (Low-Rank Adaptation)
- Adapts only 0.1-1% parameters
- Cost : 200-2000€ for Llama 8B
- Memory : 16-24GB (RTX 4090, A6000)
- Performance : 90-95% of full fine-tune
Modern standard 2024-2026.
QLoRA (Quantized LoRA)
- 4-bit quantization + LoRA
- Cost : 100-500€ Llama 8B
- Memory : 12-16GB (RTX 3090, 4090)
- Performance : 85-92% full
- Allows fine-tuning Llama 70B on 1 GPU
For tight budget.
Dataset preparation
Conversation format (chat)
`json
{
"messages": [
{"role": "system", "content": "You are a Senegal legal expert."},
{"role": "user", "content": "What documents to create SARL?"},
{"role": "assistant", "content": "To create SARL in Senegal, you need: 1) signed articles..."}
]
}
`
Optimal quantity
- Simple domain adaptation : 200-500 examples
- Specific style / format : 500-1500
- Complex new skill : 2000-10K
More isn't better. Quality > quantity.
Dataset sources
- Generated by GPT-4 / Claude (synthetic)
- Production logs (anonymized)
- Internal documentation (Q&A transformed)
- Existing reviews / FAQ
- Human annotations (gold standard)
Llama 8B with LoRA fine-tuning pipeline
Step 1 — environment
`bash
pip install transformers peft trl bitsandbytes accelerate
`
Step 2 — load model
`python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype='bfloat16',
)
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Llama-3.1-8B-Instruct',
quantization_config=bnb_config,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora_config)
`
Step 3 — train
`python
from trl import SFTTrainer
Need a professional website?
Kolonell builds websites that attract clients, optimized for the Sénégalese market. Free quote in 2 minutes.
from datasets import load_dataset
dataset = load_dataset('json', data_files='train.jsonl', split='train')
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
output_dir='./llama-finetune',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
save_strategy='epoch',
),
peft_config=lora_config,
max_seq_length=2048,
)
trainer.train()
trainer.save_model()
`
Step 4 — merge + deploy
`python
# Merge LoRA into base model
model = model.merge_and_unload()
model.save_pretrained('./llama-finetuned-merged')
# Deploy via vLLM, TGI, or Ollama
`
Hosting fine-tuned model
Option 1 — self-hosted vLLM
`bash
vllm serve ./llama-finetuned-merged --port 8000
`
Dedicated GPU: 50-300€/month (cloud RTX 4090) / 200-1500€ (A100/H100).
Option 2 — Together AI / Replicate / Modal
`
Upload model → API endpoint
Cost: pay-per-use, 0.10-2€/1M tokens
Good for MVP, auto-scales
`
Option 3 — local Ollama (dev)
`bash
ollama create my-model -f Modelfile
ollama run my-model
`
Fine-tuned model eval
`python
# Classic metrics
- Perplexity on validation set
- BLEU / ROUGE for summarization
- Exact match / accuracy for QA
# Modern metrics (LLM-as-judge)
- LLM judge on quality / faithfulness
- A/B test with base model
- Human eval on 100 examples
`
Total Llama 8B fine-tuning cost 2026
| Item | Cost |
|---|---|
| Dataset prep (5K examples) | 0-500€ (per source) |
| GPU train rental (RunPod/Lambda) | 30-200€ (3-12h H100) |
| Dedicated model hosting (monthly) | 50-500€ |
| Eval + iteration | 100-500€ |
| Startup total | 180-1700€ |
vs GPT-4o at $2.50 / 1M output tokens: saving after 100M output tokens ($250 GPT-4 = breakeven).
Common mistakes
- Too small dataset (<100 examples) — guaranteed overfit.
- Noisy dataset — model learns badly.
- No baseline eval — don't know if fine-tune improves.
- Not merging LoRA before deployment — increased latency.
- Bad hyperparameters — LR too high = catastrophic forgetting.
- No regression test — fine-tuned model loses general abilities.
FAQ
Q: How long does Llama 8B with LoRA train take?
A: 3-12 hours on 1 H100 or 12-48h on RTX 4090, for 5K examples.
Q: Fine-tune Llama vs GPT-4 fine-tuning API?
A: OpenAI fine-tuning $25/1M train tokens + 6x inference cost. Llama LoRA + self-host often cheaper at scale.
Q: When RAG vs fine-tune?
A: RAG if knowledge changes. Fine-tune if specific style or skill. Often both.
Conclusion
Open source LLM fine-tuning 2026 = accessible to SMEs: 200-2000€ for Llama 8B with LoRA. Standard workflow: dataset → LoRA → eval → deploy via vLLM. Clear ROI when style/format/cost matter. Otherwise RAG suffices.
Mohamed Bah
Fondateur, Kolonell
Passionate about digital and entrepreneurship in Africa, Mohamed has been helping Sénégalese businesses with their digital transformation since 2020. Founder of Kolonell, he believes every SME deserves a professional and accessible online présence.