Accueil›Blog›Test technique RAG : architecture, embeddings, bases vectorielles

Guide recrutement data

Test technique RAG : architecture, embeddings, bases vectorielles

Le RAG est le pattern dominant pour deployer des LLMs sur des donnees internes. En entretien, on evalue la capacite a choisir la bonne infrastructure et a optimiser la qualite du retrieval.

Data Builder·Juin 2025·7 min de lecture·Data Scientist · Data Engineer

Sommaire

RAG vs fine-tuning
Pipeline complet
Embeddings et chunking
Bases vectorielles
Optimisation du retrieval
Choix du LLM
Grille

Le RAG (Retrieval-Augmented Generation) connecte un LLM a des donnees internes sans le reentrainer. En entretien Senior, on distingue les profils qui ont deploye un RAG en production de ceux qui en ont fait un POC.

1RAG vs fine-tuning : la question discriminante

Question discriminante

Quelle est la difference entre RAG et fine-tuning ? Dans quel cas choisissez-vous l un plutot que l autre ?

RAG — enrichit le prompt avec des documents pertinents. Le modele ne s entraine pas : il recoit du contexte
Fine-tuning — re-entraine le modele sur des donnees specifiques pour modifier son comportement. Plus couteux, moins flexible
Quand RAG — donnees qui changent frequemment, besoin de tracabilite, contraintes de confidentialite
Quand fine-tuning — style tres specifique, domaine ultra-specialise, modele embarque sans reseau

Rappel fondamental : UN RAG N APPREND PAS. Il cherche du contexte pour enrichir le prompt. Confondre RAG et fine-tuning est un NO-GO immediat.

2Pipeline RAG de bout en bout

Question discriminante

Decrivez les etapes d un pipeline RAG de la source documentaire jusqu a la reponse finale.

Pipeline OFFLINE (ingestion) :
1. Chargement docs (PDF, Word, HTML, SQL...)
2. Chunking : decoupage en morceaux 256-512 tokens
3. Embedding : conversion en vecteur numerique
4. Indexation dans la base vectorielle

Pipeline ONLINE (temps reel) :
1. Question utilisateur
2. Embedding de la question
3. Recherche des k chunks les plus proches
4. Construction du prompt : question + chunks
5. Generation LLM
6. Citation des sources

3Embeddings et chunking

Question discriminante

Quelle est la taille de chunk optimale ? Comment gerez-vous le chevauchement ?

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=['

', '
', '.', ' ']
)
chunks = splitter.split_documents(docs)

256-512 tokens — sweet spot. Trop petit = perte contexte, trop grand = dilution pertinence
Chunk overlap 10-15% — evite de couper une idee entre deux chunks
Modeles d embedding — text-embedding-3-small (OpenAI), BGE-M3 (open source multilingue), Cohere Embed v3

4Bases vectorielles : savoir choisir

Question discriminante

Quelle base vectorielle choisissez-vous pour un RAG en production sur GCP ? Et pour un POC rapide ?

Solution	Type	Ideal pour
Vertex AI Vector Search	Managee GCP	Production GCP haute scalabilite
Pinecone	Managee tiers	POC rapide, SaaS
Qdrant	Open source self-hosted	Projets sensibles, controle total
ChromaDB	Open source local	Dev local, POC
pgvector	Extension PostgreSQL	Equipes SQL, infra existante

HNSW — haute precision, memoire elevee. Defaut Qdrant et Weaviate
IVF-PQ — plus scalable, precision legerement reduite sur tres gros volumes
Similarite cosinus — metrique standard pour les embeddings de texte

5Optimisation du retrieval

Question discriminante

Comment ameliorez-vous la qualite du retrieval quand les resultats ne sont pas assez pertinents ?

Hybrid search — combiner recherche vectorielle (semantique) + BM25 (lexicale)
Re-ranking — apres retrieval, scorer avec un modele de cross-attention (Cohere Rerank, BGE Reranker)
Query reformulation — transformer la question en plusieurs requetes
Filtrage par metadata — filtrer par source, date, auteur avant la recherche vectorielle

6Choix du LLM de generation

LLM	Avantages	Cas d usage
GPT-4o	Qualite elevee, context 128k, multimodal	Production generale
Claude 3.5	Context 200k, excellent documents longs	Analyse documentaire
Gemini 1.5 Pro	Context 1M tokens	Stack GCP, tres longs documents
Llama 3.1 / Mistral	Open source, auto-heberge	Donnees confidentielles, on-premise

from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
import cohere

# Qdrant en production + Hybrid Search
client = QdrantClient("http://qdrant:6333")
client.create_collection("knowledge_base",
    vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams()}  # pour hybrid search
)

# Hybrid search : vectoriel (sémantique) + BM25 (lexical)
results = client.query_points(
    collection_name="knowledge_base",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
    limit=10
)

# Re-ranking Cohere après retrieval
co = cohere.Client()
reranked = co.rerank(
    model="rerank-multilingual-v3.0",
    query=user_question,
    documents=[r.payload["text"] for r in results.points],
    top_n=3
)

# Évaluation avec RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])

Chunking par structure — pour les documents structurés (PDF, markdown), chunker par heading plutôt que par taille fixe. Meilleure cohérence sémantique
Parent-child chunking — indexer de petits chunks pour la précision du retrieval, mais retourner le chunk parent (contexte plus large) au LLM
Metadata filtering — filtrer par date, source, auteur avant la recherche vectorielle. Réduit le bruit et améliore la précision sans changer le modèle
RAGAS evaluation — faithfulness (réponse ancrée dans le contexte ?), answer_relevancy (répond à la question ?), context_recall (contexte pertinent récupéré ?). Mettre en place avant la mise en prod
Observabilité — tracer chaque requête (query, chunks, réponse, score) avec Langfuse ou LangSmith. Détecter les gaps de connaissance et les hallucinations systématiques

Chunking par structure - pour les documents structures (PDF, markdown), chunker par heading plutot que par taille fixe. Meilleure coherence semantique
Parent-child chunking - indexer de petits chunks pour la precision du retrieval, retourner le chunk parent (contexte plus large) au LLM
Metadata filtering - filtrer par date, source, auteur avant la recherche vectorielle. Reduit le bruit et ameliore la precision sans changer le modele
RAGAS evaluation - faithfulness (reponse ancree dans le contexte ?), answer_relevancy (repond a la question ?), context_recall. Mettre en place avant la mise en prod
Observabilite - tracer chaque requete (query, chunks, reponse, score) avec Langfuse ou LangSmith. Detecter les gaps de connaissance et les hallucinations systematiques

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Junior	Comprend RAG, a fait un POC LangChain	Explique pipeline offline/online, a utilise ChromaDB	Confond RAG et fine-tuning
Confirme	Chunking, choix embedding, bases vecto production	Justifie la taille de chunk, a deploye sur Qdrant ou pgvector	N a utilise que ChromaDB local
Senior	Hybrid search, re-ranking, evaluation RAGAS	A implemente re-ranking, mesure faithfulness avec RAGAS	N a jamais evalue la qualite de son RAG

RAG (Retrieval-Augmented Generation) connects an LLM to internal data without retraining it. In Senior interviews, profiles who have deployed a RAG in production are distinguished from those who only built a POC.

1RAG vs fine-tuning: the discriminating question

Discriminating question

What is the difference between RAG and fine-tuning? In which case do you choose one over the other?

RAG — enriches the prompt with relevant documents. The model does not train: it receives context
Fine-tuning — retrains the model on specific data to modify its behavior. More costly, less flexible
When RAG — frequently changing data, need for traceability, confidentiality constraints
When fine-tuning — very specific style, ultra-specialized domain, embedded model without network

Fundamental reminder: A RAG DOES NOT LEARN. It searches for context to enrich the prompt. Confusing RAG and fine-tuning is an immediate NO-GO.

2End-to-end RAG pipeline

Discriminating question

Describe the steps of a RAG pipeline from the document source to the final response.

OFFLINE Pipeline (ingestion) :
1. Document loading (PDF, Word, HTML, SQL...)
2. Chunking : splitting into pieces of 256-512 tokens
3. Embedding : conversion into numerical vector
4. Indexing in the vector database

ONLINE Pipeline (real-time) :
1. User question
2. Embedding of the question
3. Search for the k closest chunks
4. Prompt construction : question + chunks
5. LLM generation
6. Source citation

3Embeddings and chunking

Discriminating question

What is the optimal chunk size? How do you handle overlap?

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=['

', '
', '.', ' ']
)
chunks = splitter.split_documents(docs)

256-512 tokens — sweet spot. Too small = loss of context, too large = relevance dilution
Chunk overlap 10-15% — avoids cutting an idea between two chunks
Embedding models — text-embedding-3-small (OpenAI), BGE-M3 (multilingual open source), Cohere Embed v3

4Vector databases: knowing how to choose

Discriminating question

Which vector database do you choose for a RAG in production on GCP? And for a quick POC?

Solution	Type	Ideal for
Vertex AI Vector Search	GCP Managed	High scalability GCP production
Pinecone	Third-party managed	Quick POC, SaaS
Qdrant	Open source self-hosted	Sensitive projects, full control
ChromaDB	Open source local	Local dev, POC
pgvector	PostgreSQL extension	SQL teams, existing infrastructure

HNSW — high precision, high memory. Default for Qdrant and Weaviate
IVF-PQ — more scalable, slightly reduced precision on very large volumes
Cosine similarity — standard metric for text embeddings

5Retrieval optimization

Discriminating question

How do you improve retrieval quality when results are not relevant enough?

Hybrid search — combine vector search (semantic) + BM25 (lexical)
Re-ranking — after retrieval, score with a cross-attention model (Cohere Rerank, BGE Reranker)
Query reformulation — transform the question into multiple queries
Metadata filtering — filter by source, date, author before vector search

6Choosing the generation LLM

LLM	Advantages	Use cases
GPT-4o	High quality, 128k context, multimodal	General production
Claude 3.5	200k context, excellent for long documents	Document analysis
Gemini 1.5 Pro	1M token context	GCP stack, very long documents
Llama 3.1 / Mistral	Open source, self-hosted	Confidential data, on-premise

from qdrant_client import QdrantClient\nfrom qdrant_client.http.models import Distance, VectorParams\nimport cohere\n\n# Qdrant in production + Hybrid Search\nclient = QdrantClient("http://qdrant:6333")\nclient.create_collection("knowledge_base",\n    vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},\n    sparse_vectors_config={"sparse": SparseVectorParams()}  # for hybrid search\n)\n\n# Hybrid search : vector (semantic) + BM25 (lexical)\nresults = client.query_points(\n    collection_name="knowledge_base",\n    prefetch=[\n        Prefetch(query=dense_vector, using="dense", limit=20),\n        Prefetch(query=sparse_vector, using="sparse", limit=20),\n    ],\n    query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion\n    limit=10\n)\n\n# Cohere re-ranking after retrieval\nco = cohere.Client()\nreranked = co.rerank(\n    model="rerank-multilingual-v3.0",\n    query=user_question,\n    documents=[r.payload["text"] for r in results.points],\n    top_n=3\n)\n\n# Evaluation with RAGAS\nfrom ragas import evaluate\nfrom ragas.metrics import faithfulness, answer_relevancy, context_recall\nscore = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])

Structure-based chunking — for structured documents (PDF, markdown), chunk by heading rather than fixed size. Better semantic coherence
Parent-child chunking — index small chunks for retrieval precision, but return the parent chunk (broader context) to the LLM
Metadata filtering — filter by date, source, author before vector search. Reduces noise and improves precision without changing the model
RAGAS evaluation — faithfulness (response grounded in context?), answer_relevancy (answers the question?), context_recall (relevant context retrieved?). Set up before going to production
Observability — trace each request (query, chunks, response, score) with Langfuse or LangSmith. Detect knowledge gaps and systematic hallucinations

Structure-based chunking — for structured documents (PDF, markdown), chunk by heading rather than fixed size. Better semantic coherence
Parent-child chunking — index small chunks for retrieval precision, return the parent chunk (broader context) to the LLM
Metadata filtering — filter by date, source, author before vector search. Reduces noise and improves precision without changing the model
RAGAS evaluation — faithfulness (response grounded in context?), answer_relevancy (answers the question?), context_recall. Set up before going to production
Observability — trace each request (query, chunks, response, score) with Langfuse or LangSmith. Detect knowledge gaps and systematic hallucinations

7Level grid

Level	Mastery	GO signal	NO-GO
Junior	Understands RAG, built a LangChain POC	Explains offline/online pipeline, has used ChromaDB	Confuses RAG and fine-tuning
Mid-level	Chunking, embedding selection, production vector databases	Justifies chunk size, has deployed on Qdrant or pgvector	Has only used local ChromaDB
Senior	Hybrid search, re-ranking, RAGAS evaluation	Has implemented re-ranking, measures faithfulness with RAGAS	Has never evaluated the quality of their RAG

Vous recrutez un profil GenAI ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel