"RAG Core Study (16/26) — Experiment Automation with LangSmith, Phoenix, and MLflow"

If you cannot reproduce why retrieval improved last week, then you do not yet have an evaluation system. You only have impressions.

RAG pipelines change constantly: chunk size, retriever choice, fusion weight, reranker, prompt, context budget, and filtering rules. Without experiment automation, those changes accumulate into anecdotal confusion. Part 16 explains how to treat RAG improvement as a real experimental workflow: version the inputs, trace the pipeline, log the metrics, and preserve failure cases.


0. Prerequisites

  • Part 14 evaluation sets
  • Part 15 retrieval metrics
  • Part 13 rerankers and multi-stage retrieval

1. Learning Objectives

  1. Explain why RAG needs experiment automation instead of ad hoc testing.
  2. Distinguish trace logging, metric logging, and artifact logging.
  3. Understand where LangSmith, Phoenix, and MLflow each fit.
  4. Design a minimal experiment record that can actually be replayed later.

2. ํ•ต์‹ฌ ์š”์•ฝ

Experiment automation means each run records at least: dataset version, corpus/index version, retrieval settings, prompt/model settings, trace samples, and evaluation results. LangSmith is strong for LLM application traces and experiment comparisons. Phoenix is strong for observability and evaluation inspection. MLflow is strong as a general-purpose experiment registry. The most important decision is not the tool. It is the recording unit. A single failed query should remain inspectable later with its retrieved documents and final answer.


3. Intuition — Why “It Was Better Last Week” Is Not an Explanation

Suppose you changed:

  • chunk size from 400 to 600
  • Dense + BM25 weighting from 0.5 to 0.7
  • reranker model from bge-reranker-v2-m3 to jina-reranker-v2
  • prompt wording for citation style

If quality rises, what caused it? Without run tracking, you cannot tell whether the gain came from retrieval, reranking, or generation formatting.


4. Definitions — The Core Objects of Experiment Tracking

Item Meaning
Run One recorded experiment execution
Trace Step-by-step record from query to retrieval to answer
Artifact Persisted outputs such as reports, sample answers, configuration snapshots
Dataset Version Frozen version of the evaluation question set
Prompt Version Versioned prompt template
Index Version Identifiable version of the document index used in retrieval

5. Structure — What Must Be Logged Together

At minimum, each run should preserve:

  1. eval dataset version
  2. corpus or index version
  3. retrieval settings
  4. generation settings
  5. selected trace examples
  6. numeric metric outputs

If any one of these is missing, later comparison becomes ambiguous.


6. Walkthrough — A Minimal Experiment Record

6.1 Logging with MLflow

with mlflow.start_run(run_name="hybrid_rerank_v3"):
    mlflow.log_param("embedding_model", "bge-m3")
    mlflow.log_param("reranker", "bge-reranker-v2-m3")
    mlflow.log_param("chunk_size", 600)
    mlflow.log_param("top_k", 20)

    metrics = evaluate_pipeline(eval_dataset)
    mlflow.log_metrics(metrics)

6.2 Saving a trace

trace = {
    "question": question,
    "retrieved_ids": retrieved_ids,
    "reranked_ids": reranked_ids,
    "final_answer": answer,
    "faithfulness": score["faithfulness"],
}
save_trace(trace)

6.3 Why traces matter

Metrics tell you that something changed. Traces tell you where it changed:

  • retrieval missed the correct document
  • reranker pushed the correct document down
  • answer ignored the right context

Self-explanation: Why are metrics without traces usually insufficient for debugging RAG regressions?


7. Variants and Use Cases

7.1 LangSmith — Best for LLM pipeline traces

What changes
Each query execution becomes inspectable as a multi-step chain.

Why it matters
RAG failures often come from interactions between several stages rather than one isolated model call.

What it enables
You can compare two prompt or retrieval configurations query by query.

Limit and next step
For broader MLOps-style experiment registries, MLflow may remain the simpler backbone.

7.2 Phoenix — Best for observability and eval inspection

Phoenix is especially useful when you want to inspect production-like traces and evaluation behaviour visually.

7.3 MLflow — Best as a neutral experiment registry

If your team already runs model pipelines in MLflow, it can hold RAG runs too with less organisational friction.


8. Limits and Failure Modes

8.1 Tooling without conventions produces clutter

If run names, dataset versions, and baseline tags are inconsistent, the dashboard becomes hard to trust.

8.2 Over-logging creates cost and privacy problems

Logging every prompt, chunk, and answer indefinitely may be operationally expensive or legally risky.

8.3 Metrics without narrative notes stay hard to learn from

Teams often need a short explanation of why a run was considered better, not only the scores themselves.

8.4 Next step — The pipeline should adapt to different query types

Once experiments are logged well, the next question becomes: which query types are still failing? That leads directly to Part 17.


8.5 Common Pitfalls

# Pitfall Symptom Fast Check
1 no dataset version impossible comparisons version eval sets explicitly
2 no traces poor failure diagnosis store sample traces per run
3 no baseline run drift feels subjective keep one stable reference run
4 inconsistent run naming dashboard confusion define a naming convention
5 no privacy filter sensitive content leaks into logs mask or redact where needed

9. Self-check — Answer Before Looking

Q1. What is the real purpose of experiment automation in RAG?

Answer To make changes reproducible and comparable over time.
Why RAG pipelines change across many interacting stages, not one isolated parameter.

Q2. Why are traces essential in addition to metrics?

Answer Because they show which stage caused the score change.
Why Retrieval, reranking, and answer generation can fail differently.

Q3. What is one minimal rule every team should define early?

Answer A versioning rule for datasets, runs, and baselines.
Why Tooling without conventions quickly turns into clutter.

Q4. Why can over-logging be dangerous?

Answer It can increase storage cost and expose sensitive content.
Why RAG traces may contain internal documents, prompts, and answers.


Cheat Sheet — One-page Summary

Definitions - Run: one experiment execution - Trace: step-by-step pipeline log - Artifact: saved output such as config snapshot or report

Minimal code

mlflow.log_param("chunk_size", 600)
mlflow.log_metrics({"mrr": 0.71, "recall_at_10": 0.92})

When to use what | Situation | Best fit | |---|---| | query-by-query LLM tracing | LangSmith | | observability and eval inspection | Phoenix | | broad experiment registry | MLflow |


References

Official docs

  • LangSmith docs
  • Arize Phoenix docs
  • MLflow tracking docs

Supporting notes

  • User notes, chapter 13 experiment automation

Bridge to the Next Part

Once experiments are recorded properly, the next natural question is not whether the system fails, but which types of questions still fail. Part 17 turns to query classification.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System