"RAG Core Study (16/26) — Experiment Automation with LangSmith, Phoenix, and MLflow"
If you cannot reproduce why retrieval improved last week, then you do not yet have an evaluation system. You only have impressions.
RAG pipelines change constantly: chunk size, retriever choice, fusion weight, reranker, prompt, context budget, and filtering rules. Without experiment automation, those changes accumulate into anecdotal confusion. Part 16 explains how to treat RAG improvement as a real experimental workflow: version the inputs, trace the pipeline, log the metrics, and preserve failure cases.
0. Prerequisites
- Part 14 evaluation sets
- Part 15 retrieval metrics
- Part 13 rerankers and multi-stage retrieval
1. Learning Objectives
- Explain why RAG needs experiment automation instead of ad hoc testing.
- Distinguish trace logging, metric logging, and artifact logging.
- Understand where LangSmith, Phoenix, and MLflow each fit.
- Design a minimal experiment record that can actually be replayed later.
2. ํต์ฌ ์์ฝ
Experiment automation means each run records at least: dataset version, corpus/index version, retrieval settings, prompt/model settings, trace samples, and evaluation results. LangSmith is strong for LLM application traces and experiment comparisons. Phoenix is strong for observability and evaluation inspection. MLflow is strong as a general-purpose experiment registry. The most important decision is not the tool. It is the recording unit. A single failed query should remain inspectable later with its retrieved documents and final answer.
3. Intuition — Why “It Was Better Last Week” Is Not an Explanation
Suppose you changed:
- chunk size from 400 to 600
- Dense + BM25 weighting from 0.5 to 0.7
- reranker model from
bge-reranker-v2-m3tojina-reranker-v2 - prompt wording for citation style
If quality rises, what caused it? Without run tracking, you cannot tell whether the gain came from retrieval, reranking, or generation formatting.
4. Definitions — The Core Objects of Experiment Tracking
| Item | Meaning |
|---|---|
| Run | One recorded experiment execution |
| Trace | Step-by-step record from query to retrieval to answer |
| Artifact | Persisted outputs such as reports, sample answers, configuration snapshots |
| Dataset Version | Frozen version of the evaluation question set |
| Prompt Version | Versioned prompt template |
| Index Version | Identifiable version of the document index used in retrieval |
5. Structure — What Must Be Logged Together
At minimum, each run should preserve:
- eval dataset version
- corpus or index version
- retrieval settings
- generation settings
- selected trace examples
- numeric metric outputs
If any one of these is missing, later comparison becomes ambiguous.
6. Walkthrough — A Minimal Experiment Record
6.1 Logging with MLflow
with mlflow.start_run(run_name="hybrid_rerank_v3"):
mlflow.log_param("embedding_model", "bge-m3")
mlflow.log_param("reranker", "bge-reranker-v2-m3")
mlflow.log_param("chunk_size", 600)
mlflow.log_param("top_k", 20)
metrics = evaluate_pipeline(eval_dataset)
mlflow.log_metrics(metrics)
6.2 Saving a trace
trace = {
"question": question,
"retrieved_ids": retrieved_ids,
"reranked_ids": reranked_ids,
"final_answer": answer,
"faithfulness": score["faithfulness"],
}
save_trace(trace)
6.3 Why traces matter
Metrics tell you that something changed. Traces tell you where it changed:
- retrieval missed the correct document
- reranker pushed the correct document down
- answer ignored the right context
Self-explanation: Why are metrics without traces usually insufficient for debugging RAG regressions?
7. Variants and Use Cases
7.1 LangSmith — Best for LLM pipeline traces
What changes
Each query execution becomes inspectable as a multi-step chain.
Why it matters
RAG failures often come from interactions between several stages rather than one isolated model call.
What it enables
You can compare two prompt or retrieval configurations query by query.
Limit and next step
For broader MLOps-style experiment registries, MLflow may remain the simpler backbone.
7.2 Phoenix — Best for observability and eval inspection
Phoenix is especially useful when you want to inspect production-like traces and evaluation behaviour visually.
7.3 MLflow — Best as a neutral experiment registry
If your team already runs model pipelines in MLflow, it can hold RAG runs too with less organisational friction.
8. Limits and Failure Modes
8.1 Tooling without conventions produces clutter
If run names, dataset versions, and baseline tags are inconsistent, the dashboard becomes hard to trust.
8.2 Over-logging creates cost and privacy problems
Logging every prompt, chunk, and answer indefinitely may be operationally expensive or legally risky.
8.3 Metrics without narrative notes stay hard to learn from
Teams often need a short explanation of why a run was considered better, not only the scores themselves.
8.4 Next step — The pipeline should adapt to different query types
Once experiments are logged well, the next question becomes: which query types are still failing? That leads directly to Part 17.
8.5 Common Pitfalls
| # | Pitfall | Symptom | Fast Check |
|---|---|---|---|
| 1 | no dataset version | impossible comparisons | version eval sets explicitly |
| 2 | no traces | poor failure diagnosis | store sample traces per run |
| 3 | no baseline run | drift feels subjective | keep one stable reference run |
| 4 | inconsistent run naming | dashboard confusion | define a naming convention |
| 5 | no privacy filter | sensitive content leaks into logs | mask or redact where needed |
9. Self-check — Answer Before Looking
Q1. What is the real purpose of experiment automation in RAG?
Answer To make changes reproducible and comparable over time.
Why RAG pipelines change across many interacting stages, not one isolated parameter.
Q2. Why are traces essential in addition to metrics?
Answer Because they show which stage caused the score change.
Why Retrieval, reranking, and answer generation can fail differently.
Q3. What is one minimal rule every team should define early?
Answer A versioning rule for datasets, runs, and baselines.
Why Tooling without conventions quickly turns into clutter.
Q4. Why can over-logging be dangerous?
Answer It can increase storage cost and expose sensitive content.
Why RAG traces may contain internal documents, prompts, and answers.
Cheat Sheet — One-page Summary
Definitions - Run: one experiment execution - Trace: step-by-step pipeline log - Artifact: saved output such as config snapshot or report
Minimal code
mlflow.log_param("chunk_size", 600)
mlflow.log_metrics({"mrr": 0.71, "recall_at_10": 0.92})
When to use what | Situation | Best fit | |---|---| | query-by-query LLM tracing | LangSmith | | observability and eval inspection | Phoenix | | broad experiment registry | MLflow |
References
Official docs
- LangSmith docs
- Arize Phoenix docs
- MLflow tracking docs
Supporting notes
- User notes, chapter 13 experiment automation
Bridge to the Next Part
Once experiments are recorded properly, the next natural question is not whether the system fails, but which types of questions still fail. Part 17 turns to query classification.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ