"RAG Core Study (16/26) — Experiment Automation with LangSmith, Phoenix, and MLflow"

5월 18, 2026

If you cannot reproduce why retrieval improved last week, then you do not yet have an evaluation system. You only have impressions.

RAG pipelines change constantly: chunk size, retriever choice, fusion weight, reranker, prompt, context budget, and filtering rules. Without experiment automation, those changes accumulate into anecdotal confusion. Part 16 explains how to treat RAG improvement as a real experimental workflow: version the inputs, trace the pipeline, log the metrics, and preserve failure cases.

0. Prerequisites

Part 14 evaluation sets
Part 15 retrieval metrics
Part 13 rerankers and multi-stage retrieval

1. Learning Objectives

Explain why RAG needs experiment automation instead of ad hoc testing.
Distinguish trace logging, metric logging, and artifact logging.
Understand where LangSmith, Phoenix, and MLflow each fit.
Design a minimal experiment record that can actually be replayed later.

2. 핵심 요약

Experiment automation means each run records at least: dataset version, corpus/index version, retrieval settings, prompt/model settings, trace samples, and evaluation results. LangSmith is strong for LLM application traces and experiment comparisons. Phoenix is strong for observability and evaluation inspection. MLflow is strong as a general-purpose experiment registry. The most important decision is not the tool. It is the recording unit. A single failed query should remain inspectable later with its retrieved documents and final answer.

3. Intuition — Why “It Was Better Last Week” Is Not an Explanation

Suppose you changed:

chunk size from 400 to 600
Dense + BM25 weighting from 0.5 to 0.7
reranker model from bge-reranker-v2-m3 to jina-reranker-v2
prompt wording for citation style

If quality rises, what caused it? Without run tracking, you cannot tell whether the gain came from retrieval, reranking, or generation formatting.

4. Definitions — The Core Objects of Experiment Tracking

Item	Meaning
Run	One recorded experiment execution
Trace	Step-by-step record from query to retrieval to answer
Artifact	Persisted outputs such as reports, sample answers, configuration snapshots
Dataset Version	Frozen version of the evaluation question set
Prompt Version	Versioned prompt template
Index Version	Identifiable version of the document index used in retrieval

5. Structure — What Must Be Logged Together

At minimum, each run should preserve:

eval dataset version
corpus or index version
retrieval settings
generation settings
selected trace examples
numeric metric outputs

If any one of these is missing, later comparison becomes ambiguous.

6. Walkthrough — A Minimal Experiment Record

6.1 Logging with MLflow

with mlflow.start_run(run_name="hybrid_rerank_v3"):
    mlflow.log_param("embedding_model", "bge-m3")
    mlflow.log_param("reranker", "bge-reranker-v2-m3")
    mlflow.log_param("chunk_size", 600)
    mlflow.log_param("top_k", 20)

    metrics = evaluate_pipeline(eval_dataset)
    mlflow.log_metrics(metrics)

6.2 Saving a trace

trace = {
    "question": question,
    "retrieved_ids": retrieved_ids,
    "reranked_ids": reranked_ids,
    "final_answer": answer,
    "faithfulness": score["faithfulness"],
}
save_trace(trace)

6.3 Why traces matter

Metrics tell you that something changed. Traces tell you where it changed:

retrieval missed the correct document
reranker pushed the correct document down
answer ignored the right context

Self-explanation: Why are metrics without traces usually insufficient for debugging RAG regressions?

7. Variants and Use Cases

7.1 LangSmith — Best for LLM pipeline traces

What changes
Each query execution becomes inspectable as a multi-step chain.

Why it matters
RAG failures often come from interactions between several stages rather than one isolated model call.

What it enables
You can compare two prompt or retrieval configurations query by query.

Limit and next step
For broader MLOps-style experiment registries, MLflow may remain the simpler backbone.

7.2 Phoenix — Best for observability and eval inspection

Phoenix is especially useful when you want to inspect production-like traces and evaluation behaviour visually.

7.3 MLflow — Best as a neutral experiment registry

If your team already runs model pipelines in MLflow, it can hold RAG runs too with less organisational friction.

8. Limits and Failure Modes

8.1 Tooling without conventions produces clutter

If run names, dataset versions, and baseline tags are inconsistent, the dashboard becomes hard to trust.

8.2 Over-logging creates cost and privacy problems

Logging every prompt, chunk, and answer indefinitely may be operationally expensive or legally risky.

8.3 Metrics without narrative notes stay hard to learn from

Teams often need a short explanation of why a run was considered better, not only the scores themselves.

8.4 Next step — The pipeline should adapt to different query types

Once experiments are logged well, the next question becomes: which query types are still failing? That leads directly to Part 17.

8.5 Common Pitfalls

#	Pitfall	Symptom	Fast Check
1	no dataset version	impossible comparisons	version eval sets explicitly
2	no traces	poor failure diagnosis	store sample traces per run
3	no baseline run	drift feels subjective	keep one stable reference run
4	inconsistent run naming	dashboard confusion	define a naming convention
5	no privacy filter	sensitive content leaks into logs	mask or redact where needed

9. Self-check — Answer Before Looking

Q1. What is the real purpose of experiment automation in RAG?

Answer To make changes reproducible and comparable over time.
Why RAG pipelines change across many interacting stages, not one isolated parameter.

Q2. Why are traces essential in addition to metrics?

Answer Because they show which stage caused the score change.
Why Retrieval, reranking, and answer generation can fail differently.

Q3. What is one minimal rule every team should define early?

Answer A versioning rule for datasets, runs, and baselines.
Why Tooling without conventions quickly turns into clutter.

Q4. Why can over-logging be dangerous?

Answer It can increase storage cost and expose sensitive content.
Why RAG traces may contain internal documents, prompts, and answers.

Cheat Sheet — One-page Summary

Definitions - Run: one experiment execution - Trace: step-by-step pipeline log - Artifact: saved output such as config snapshot or report

Minimal code

mlflow.log_param("chunk_size", 600)
mlflow.log_metrics({"mrr": 0.71, "recall_at_10": 0.92})

When to use what | Situation | Best fit | |---|---| | query-by-query LLM tracing | LangSmith | | observability and eval inspection | Phoenix | | broad experiment registry | MLflow |

References

Official docs

LangSmith docs
Arize Phoenix docs
MLflow tracking docs

Supporting notes

User notes, chapter 13 experiment automation

Bridge to the Next Part

Once experiments are recorded properly, the next natural question is not whether the system fails, but which types of questions still fail. Part 17 turns to query classification.