Websites11 min read

RAG production patterns: LLM with private data architecture 2026

Mohamed Bah·Fondateur, Kolonell
May 31, 2026
Share:
RAG production patterns: LLM with private data architecture 2026

RAG production patterns: LLM with private data architecture 2026

Websites

RAG (Retrieval-Augmented Generation) = dominant pattern 2024-2026 to combine LLM (GPT-4, Claude, Llama) with private data (docs, KB, product database). Simple RAG MVPs everywhere, but production RAG = different. Here are architecture patterns that actually work 2026.

TL;DR

- Smart chunking > naïve 500-token chunking.

- Hybrid search (vector + keyword BM25) > vector alone.

- Re-ranking essential (Cohere, Voyage, Cross-encoder).

- Citations + source attribution non-negotiable.

- Multi-step RAG (decompose, route, synthesize) for complex queries.

Production 2026 RAG architecture

`

  • Ingestion pipeline:

Source docs → parser → chunker → embedder → vector DB

  • Query pipeline:

User query → query rewriter → hybrid search → re-ranker

→ context assembly → LLM → citation

`

Pattern 1 — Smart chunking

Naive chunking (avoid)

`python

chunks = split_every_500_tokens(document) # ❌ MVP, not production

`

Problems:

  • Cuts mid-sentences / sections
  • Loss of structural context (titles → paragraphs)
  • Degraded embeddings

Production semantic chunking

`python

def smart_chunk(doc):

# 1. Preserve structure

sections = parse_markdown(doc) # or XML, HTML

chunks = []

for section in sections:

# 2. Chunks per section, max 800 tokens

if token_count(section) <= 800:

chunks.append({

'text': section.text,

'metadata': {

'title': section.title,

'level': section.level,

'parent_titles': section.parents

}

})

else:

# 3. Sub-chunks with 100-token overlap

sub = split_with_overlap(section, max=800, overlap=100)

chunks.extend(sub)

# 4. Enrich each chunk with parent context

for chunk in chunks:

chunk['contextualized'] = (

f"Document: {doc.title}\n"

f"Section: {' > '.join(chunk['metadata']['parent_titles'])}\n\n"

f"{chunk['text']}"

)

return chunks

`

Pattern 2 — Hybrid Search (vector + BM25)

Vector alone misses queries with exact terms (product codes, IDs). BM25 alone misses semantics. Combine:

`python

def hybrid_search(query, k=20):

# 1. Vector search (semantic)

query_embedding = embed(query)

vector_results = vector_db.search(query_embedding, k=k)

# 2. Keyword search (BM25)

keyword_results = bm25_index.search(query, k=k)

# 3. Reciprocal Rank Fusion (RRF)

fused = {}

for rank, doc in enumerate(vector_results):

fused[doc.id] = fused.get(doc.id, 0) + 1 / (60 + rank)

for rank, doc in enumerate(keyword_results):

fused[doc.id] = fused.get(doc.id, 0) + 1 / (60 + rank)

# 4. Top k by fused score

return sorted(fused.items(), key=lambda x: -x[1])[:k]

`

Pattern 3 — Mandatory re-ranking

Hybrid search retrieves top 20-50 candidates. Re-ranker (cross-encoder) re-sorts by true relevance.

`python

import cohere

co = cohere.Client(api_key=...)

def rerank(query, candidates, top_n=5):

docs = [c.text for c in candidates]

response = co.rerank(

model='rerank-english-v3.0', # or 'rerank-multilingual-v3.0'

query=query,

documents=docs,

top_n=top_n

)

return [candidates[r.index] for r in response.results]

`

Popular 2026 re-rankers:

Need a professional website?

Kolonell builds websites that attract clients, optimized for the Sénégalese market. Free quote in 2 minutes.

  • Cohere Rerank: managed, multilingual, fast
  • Voyage rerank-2: excellent quality
  • BGE Reranker: open source, self-host
  • OpenAI text-embedding-3-large: usable but slower

Re-ranking cost ~$1-3 / 1000 Cohere queries. Massive ROI on quality.

Pattern 4 — Multi-step RAG (Decompose-then-Recombine)

For complex queries, single-shot RAG fails. Decompose:

`python

def multi_step_rag(query):

# Step 1: Decompose into sub-questions

subquestions = llm_decompose(query)

# Ex: "Compare Stripe vs PayPal for Africa SaaS"

# → ["What is Stripe pricing?", "What is PayPal pricing?",

# "Stripe Africa coverage?", "PayPal Africa coverage?"]

# Step 2: Independent RAG for each

contexts = []

for sq in subquestions:

ctx = rag_retrieve(sq)

contexts.append((sq, ctx))

# Step 3: Final synthesis with all contexts

response = llm_synthesize(query, contexts)

return response

`

Increases latency (3-5x) but spectacular quality for comparisons / analyses.

Pattern 5 — Citations + source attribution

No production RAG without citations. Confidence = verifiability.

`python

def generate_with_citations(query, contexts):

prompt = f"""

Answer the question using only the sources.

For each statement, cite the source [1], [2], etc.

Sources:

{format_sources(contexts)}

Question: {query}

Answer (with citations):

"""

response = llm.complete(prompt)

return parse_citations(response, contexts)

`

UI must show clickable citations to source.

Pattern 6 — Smart caching

`python

import hashlib

class RAGCache:

def cache_key(self, query):

# Normalize: lowercase, strip, hash

return hashlib.md5(query.lower().strip().encode()).hexdigest()

def get(self, query):

# Semantic cache: embeddings similar > 0.95 → cache hit

query_emb = embed(query)

for cached_emb, response in self.cache:

if cosine(query_emb, cached_emb) > 0.95:

return response

return None

`

Saves 30-50% API calls in production.

2026 vector DBs compared

Vector DBBest forPricing
PineconeManaged, enterprise$70+/mo starter
QdrantSelf-host, OSSFree + cloud $25+
WeaviateHybrid (vector + filter)OSS + cloud
Postgres + pgvectorCombine relational$0 (already DB)
MongoDB Atlas VectorAlready MongoDB$0 (already DB)
ChromaDev/prototypeOSS
Milvus / ZillizHigh scaleCloud $50+

For startup already on MongoDB: Atlas Vector Search = no-brainer.

2026 embedding leaders

  • OpenAI text-embedding-3-large : excellent, $0.13/1M tokens
  • OpenAI text-embedding-3-small : good ratio, $0.02/1M
  • Voyage voyage-3 : very good quality, multilingual
  • Cohere embed-multilingual-v3 : multilingual best
  • BGE-M3 : open source, self-host friendly

Common RAG mistakes

  • Too small chunks (<200 tokens) — context loss.
  • Too big chunks (>2000) — drowned embeddings.
  • No metadata filtering — searches all corpus all time.
  • No re-ranking — degraded quality.
  • No citations — hidden hallucinations.
  • No continuous evaluation — silent drift.

Production RAG eval

`python

# RAG metrics

retrieval_recall = relevant_docs_in_top_k / total_relevant_docs

retrieval_precision = relevant_docs_in_top_k / k

answer_faithfulness = supported_claims / total_claims

answer_relevance = LLM_judge(query, answer)

answer_correctness = LLM_judge(answer, ground_truth)

`

Tools: Ragas, LangSmith, Phoenix Arize.

Estimated production RAG cost

ComponentMonthly cost
Embedding 1M docs$20 (OpenAI small)
Vector DB$50-300
Re-ranker$50-200 (per volume)
LLM calls (10K req/day)$300-2000
Average SaaS RAG total$500-2500/month

FAQ

Q: RAG vs fine-tuning?

A: RAG for dynamic knowledge. Fine-tuning for specific style/format/domain. Often both combined.

Q: Open source RAG framework?

A: LlamaIndex, LangChain for starter. In production, custom code often better.

Q: How many chunks to retrieve?

A: Hybrid 20-50 → re-rank → top 5-10 in final context. More = noise.

Conclusion

Production 2026 RAG ≠ MVP RAG. Patterns: smart chunking, hybrid search, re-ranking, citations, continuous eval. Cost $500-2500/month average SaaS. Investment in retrieval quality = massive ROI vs LLM alone.

Tags:#RAG#LLM#AI#Architecture#Vector DB#Embeddings
Share:

Mohamed Bah

Fondateur, Kolonell

Passionate about digital and entrepreneurship in Africa, Mohamed has been helping Sénégalese businesses with their digital transformation since 2020. Founder of Kolonell, he believes every SME deserves a professional and accessible online présence.