Supervised Fine-Tuning: When Should You Train the Model — and When Should You Not?

Supervised Fine-Tuning (SFT) has become the new buzzword in the AI world. OpenAI, Mistral, Google and a handful of open-source platforms now all offer to "train your own model" on your data. It sounds like the ultimate solution: an AI that speaks your language, knows your domain, and writes in your tone.

But SFT is one tool out of four, and it's rarely the first one you should reach for. In this article we walk through the whole spectrum — from pure prompt engineering, via RAG, to SFT — and we add a fourth approach that we've developed ourselves at broberg.ai: Trail, a compile-at-ingest knowledge engine built on a principle we call Compacting Neurons.

What is Supervised Fine-Tuning, really?

SFT means taking an existing, already-trained language model and training it further on your own dataset of input/output pairs. "Supervised" refers to the fact that you show the model the answer key: this is what a question looks like, and this is what the answer should look like.

The result is a model where your knowledge, tone, or understanding of the task is baked directly into the weights themselves. The model doesn't need long instructions or lookups — it has been shaped by your examples.

In practice, most providers use a technique called LoRA (Low-Rank Adaptation), where you don't retrain the entire model but only a small "adapter" on top of it. This makes training both faster and significantly cheaper, and it's more than sufficient for most purposes: tone of voice, fixed formats, domain language, and classification tasks.

The four approaches — an overview

Before choosing SFT, you should know the alternatives. We see them as a staircase, where each step costs more in complexity — and where you should only climb to the next step if the previous one isn't enough.

1. BASE: Prompt Engineering

The simplest approach is to not touch the model at all. Instead, you write a well-crafted instruction — a system prompt — possibly with a few examples (few-shot). The model is unchanged; all the customization lives in the context.

The strength is speed and flexibility. You can change behavior in five minutes, switch models without losing anything, and there is no training cost whatsoever.

The weakness is that the instruction has to be sent along with every single request. At high volume you pay for the same thousand tokens over and over, and the model can still "forget" instructions in long conversations. Prompt engineering also doesn't change what the model knows — only how it behaves.

2. RAG: Retrieval Augmented Generation

RAG solves the knowledge problem by giving the model access to a reference library. Your documents are chopped into pieces (chunks), converted into embeddings, and stored in a vector database. When the user asks a question, the most "similar" chunks are found and pasted into the prompt alongside the question.

The strength is that knowledge can be updated continuously without touching the model, and that answers can reference sources.

The weakness is that RAG is, fundamentally, a guess based on semantic similarity. Chunking destroys coherence — a sentence in the middle of a document loses its context. Vector search finds what resembles the question, not necessarily what answers it. And the entire pipeline (embedding model, vector database, chunking strategy, re-ranking) is a standalone system that has to be operated, tuned, and paid for.

3. SFT: Supervised Fine-Tuning

When neither instructions nor lookups are enough, you can shape the model itself.

The strength shows up in three places in particular. At high volume: a fine-tuned model doesn't need a long system prompt, so you save tokens on every single request — at hundreds of thousands of daily calls, that adds up to real money. With consistent tone and format: the model hits the style every time, without needing to be reminded. And with narrow, repetitive tasks such as classification, extraction, or domain-specific generation, where general-purpose models consistently miss the mark.

The weakness is that knowledge becomes frozen in place. If your domain changes, you have to retrain. Fine-tuning requires a curated dataset — typically hundreds of clean examples — and data curation is often the hidden cost. And most importantly: SFT teaches the model patterns, not facts. It's excellent at teaching a model to write like you, but unreliable at teaching it what you know — that's where it starts guessing confidently.

4. Trail: Compacting Neurons

At broberg.ai we work with a fourth path, one that attacks the problem from a different angle than both RAG and SFT: Trail — a compile-at-ingest knowledge engine.

Where RAG defers all the work to query time ("find something that looks similar, and hope it fits"), and SFT bakes knowledge statically into model weights, Trail does the work when knowledge arrives. Every time a document, an email, a meeting note, or an article is ingested, it is compiled: the essence is distilled, facts are structured, and typed relationships are built to the knowledge that already exists. That's what we call Compacting Neurons — knowledge is compressed into dense, connected units rather than sitting as raw text fragments in a vector database.

The consequences are significant:

No embeddings, no vector database. Trail uses classic full-text search combined with a typed graph layer. That means deterministic, explainable lookups — you can always see why a piece of knowledge was found, and where it comes from. Provenance is built in, not bolted on.

Living knowledge. Unlike SFT, nothing is frozen. New knowledge is compiled in continuously and automatically connected to what already exists. Outdated knowledge can be replaced or flagged — without retraining.

Curation at the source. Because compilation happens at ingest, the quality work happens in one place, once. RAG systems, by contrast, pay for embedding and re-ranking of the same raw material on every single query.

Comparison

	Prompt Engineering	RAG	SFT	Trail
Changes the model?	No	No	Yes (weights)	No
Knowledge can be updated	Instantly	Continuously	Requires retraining	Continuously (compiled at ingest)
Best for	Behavior and quick experiments	Lookups across large, raw document volumes	Tone, format, narrow tasks at high volume	Curated, connected knowledge with provenance
Explainability	High	Low (semantic guess)	Low (black box)	High (typed relationships, source trail)
Upfront cost	Minimal	Medium (pipeline + vector DB)	High (dataset + training)	Medium (compilation at ingest)
Ongoing cost	Tokens per call	Embedding + search per call	Low per call, retraining on change	Low (the work is done at ingest)

Our recommendation: climb the staircase — don't skip down it

Always start with prompt engineering. It solves more problems than most people think, and it costs nothing to try.

If the problem is knowledge — the model needs to know your documents, customers, or history — then the answer is a knowledge layer, not training. Here the question is whether your need is raw search across large document volumes (RAG) or curated, connected knowledge with traceability (Trail). Our experience is that the vast majority of business cases are, in reality, the latter: you don't just want to find something, you want to be able to trust it and see where it comes from.

If the problem, on the other hand, is behavior at scale — thousands of daily calls where the tone has to be locked in, or a narrow task where general-purpose models consistently fail — then SFT is the right tool. And with EU-based providers like Mistral offering hosted fine-tuning with European data residency, it doesn't have to be expensive or compliance-heavy either.

The most important point is that the four approaches don't exclude each other. The strongest architecture is often a combination: a lightly fine-tuned model for tone and format, connected to a Trail knowledge layer for facts and provenance, guided by a tight prompt. The model knows how to speak — the knowledge layer knows what is true.

broberg.ai builds AI-native tools and infrastructure out of Aalborg, Denmark. Trail is our compile-at-ingest knowledge engine — read more at trailmem.com.