"RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage"

Chunks and metadata are ready. The next decision: which embedding model to index with.

Embedding choice is the decision RAG teams regret most. Once indexed, switching models means a full re-index, and the real differences in cost, latency, and language quality only surface in production. Part 8 compares the six mainstream models of 2026 — BGE-M3 / OpenAI text-embedding-3 / Upstage Solar Embeddings / E5-mistral / Jina v3 / Voyage 3 — across dimension, multilingual quality (esp. Korean), cost, local-execution, max input, and model-card semantics. It then turns author note Chapter 35 §2's Model-aware Ingestion principle into a model-card → schema mapping flow.


0. Prerequisites

  • Part 7 (metadata). A model change forces full metadata re-copy.
  • Cosine similarity / dot product basics (the bi-encoder formulation is detailed in Part 10).
  • Tokens ≠ characters — tokenizer differences matter.

1. Learning Objectives

  1. State each of the six mainstream models in one differentiating line.
  2. Express the impact of vector dimension, normalisation, and similarity choice on quality and cost.
  3. Read a model card and turn it into an ingestion schema.
  4. Estimate the cost of switching embedding models.

2. ํ•ต์‹ฌ ์š”์•ฝ

BGE-M3 (BAAI) is the open-source multilingual standard — 1024d, strong Korean and English. OpenAI text-embedding-3-small/large offers API convenience and Matryoshka (variable dimension). Upstage Solar is Korean-optimised commercial. E5-mistral-7b-instruct is instruction-tuned and open — but requires 7B-class inference. Jina v3 is task-aware (separate prefixes for retrieval, matching, classification). Voyage 3 is retrieval-tuned commercial — Korean is its weak point. Three-line selection rule: if Korean share ≥ 30%, prefer BGE-M3 or Upstage; if fully open / on-prem, BGE-M3 or E5; if API simplicity wins, OpenAI. The model card's training input format (title/body/query prefix) must be reflected in the ingestion schema to unlock latent quality — author note Chapter 35 §2's core point.


3. Intuition — Same Query, Six Embeddings, Six Top-Ks

Embedding "how do I file an exception to the security policy?" with the six models produces six different vectors. Even on the same corpus and chunking, top-K diverges. Some models are stronger on vocabulary anchors like "exception"; others on semantic chains like "filing → procedure".

diagram-1

Which model is best for your corpus is decided by your evaluation set (Part 14), not by an MTEB average. This article shrinks the candidate set; the right answer needs measurement.


4. Definitions — Six Mainstream Models (2026)

Model Dim Max input Multilingual / Korean (0-3) Cost (1M tokens) Local Note
BGE-M3 (BAAI) 1024 8192 3 / 3 $0 (own GPU) dense + sparse + colbert in one
OpenAI text-embedding-3-small 1536 (var) 8191 2 / 2 $0.02 Matryoshka, low cost
OpenAI text-embedding-3-large 3072 (var) 8191 3 / 2 $0.13 Highest accuracy
Upstage Solar Embeddings 1.5 4096 4000 2 / 3 $0.10 Korean-tuned
E5-mistral-7b-instruct 4096 32K 3 / 2 $0 (7B GPU) ✅ (heavy) instruction-tuned, MTEB top
Jina embeddings v3 1024 (var) 8192 3 / 2 $0.018 ✅ (Apache 2.0) task-aware prefix
Voyage 3 1024 32K 2 / 0 $0.06 retrieval-tuned; weak Korean

Scores are relative, on a 0–3 scale (based on MTEB-Korean and MTEB multilingual averages). Absolute scores need an in-house evaluation set (Part 14).


5. Math — Dimension, Normalisation, Similarity

Three variables govern cost and accuracy.

  • \(d\) = vector dimension
  • \(N\) = chunk count
  • \(\|x\|\) = embedding norm

Storage (float32):

$$\text{Storage} = N \cdot d \cdot 4 \text{ bytes}$$

e.g. \(N=1M\), \(d=1024\) → 4 GB; \(d=4096\) → 16 GB. A 4× swing in storage and memory.

Search latency (HNSW approximation):

$$\text{Latency} \approx \mathcal{O}(d \cdot \log N)$$

Linear in \(d\). 1024 vs 4096 is roughly a 4× latency gap.

Similarity functions:

  • Cosine: \(\cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|}\). Direction only.
  • Dot product: \(x \cdot y\). Norm-sensitive — short vectors penalised.
  • Euclidean: \(\|x - y\|\). Distance-based.

Key rule: if the model outputs L2-normalised vectors, cosine = dot = (monotonic in) euclidean. Voyage and OpenAI output normalised; BGE-M3 and E5 also normalise in standard usage. Raw outputs without normalisation behave differently — be explicit.

Matryoshka embeddings (OpenAI 3, Jina v3): trained to be useful at multiple truncations — keep the first \(d'\) dims and lose minimal quality. \(d'=256\) yields 16× storage savings on a 3072-dim model.


6. Walkthrough — Model Card to Ingestion Schema

6.1 Four things to read from a model card

Item Decides Example
Training input format ingestion schema E5: "passage: {text}" / "query: {text}"
Max input length chunk max size BGE-M3 = 8192, Voyage 3 = 32K
Normalisation similarity function BGE-M3 recommends normalise → cosine
Instruction tuning query preprocessing E5-instruct: prepend task instruction

6.2 BGE-M3 — Standard example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

doc_emb = model.encode(
    chunk_text,
    normalize_embeddings=True,   # for cosine
)

query_emb = model.encode(
    "exception filing procedure?",
    normalize_embeddings=True,
)

6.3 E5-mistral — Instruction prefix is decisive

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")

TASK = "Given a web search query, retrieve relevant passages that answer the query"

query_text = f"Instruct: {TASK}\nQuery: exception filing procedure?"
query_emb = model.encode(query_text, normalize_embeddings=True)

doc_emb = model.encode(chunk_text, normalize_embeddings=True)

Skipping this prefix causes a noticeable quality drop. Follow the model card literally.

6.4 Jina v3 — Task-aware prefix

from transformers import AutoModel

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)

query_emb = model.encode(["exception filing procedure?"], task="retrieval.query")
doc_emb   = model.encode([chunk_text],                    task="retrieval.passage")

Same weights, different task adapters. Retrieval treats query and passage as asymmetric.

6.5 OpenAI — Matryoshka truncation

from openai import OpenAI
client = OpenAI()

resp = client.embeddings.create(
    model="text-embedding-3-large",
    input=chunk_text,
    dimensions=512,        # 3072 → 512, valid because trained that way
)
emb = resp.data[0].embedding

Storage and latency cut by 6×; accuracy loss is typically 1–3 percentage points and task-dependent. Strong when cost dominates.

6.6 Upstage Solar — Korean-tuned, asymmetric

from openai import OpenAI
client = OpenAI(api_key=UPSTAGE_KEY, base_url="https://api.upstage.ai/v1/solar")

resp = client.embeddings.create(
    model="solar-embedding-1-large-passage",   # passage / query models are separate
    input=chunk_text,
)

-passage and -query are different models. Query embeddings must use -query. A canonical asymmetric embedding setup.


7. Variants

7.1 BGE-M3 — 3-in-1 (Dense + Sparse + ColBERT)

  • What changes: a single model emits dense, sparse (BM25-like), and late-interaction (ColBERT) outputs.
  • Why use it: Hybrid Search (Part 12) from one model — operational simplicity.
  • What becomes possible: no separate sparse / ColBERT weights to manage.
  • Where it fits: multilingual + hybrid + ops-simplicity projects.
  • Limits: in pure single-mode comparisons, a specialised model edges it slightly.

7.2 OpenAI 3-large — dimension-reduction trade-off

  • What changes: trained for 3072 / 1024 / 512 / 256-dim truncations.
  • Why use it: storage and search scale by \(\times 1/d_{\text{ratio}}\).
  • What becomes possible: cost-decisive savings at multi-TB index scale.
  • Where it fits: large indices (>10M chunks), cost-sensitive.
  • Limits: 1–3 percentage points accuracy loss. Measure via Part 14.

7.3 E5-mistral — the 7B embedding cost trade

  • What changes: the embedding is a 7B-parameter LLM. GPU required.
  • Why use it: top MTEB scores with instruction tuning.
  • What becomes possible: re-use the model for many tasks via different instructions.
  • Where it fits: in-house GPU infrastructure with model-control needs.
  • Limits: 100–500 ms latency depending on GPU; chunk cost scales with GPU-hour.

7.4 Voyage 3 — retrieval-tuned, Korean-light

  • What changes: heavily fine-tuned on retrieval objectives.
  • Why use it: often beats OpenAI on English retrieval.
  • What becomes possible: English corpus with balanced cost/quality.
  • Where it fits: English-first RAG.
  • Limits: limited Korean training share → not recommended for Korean corpora. Use BGE-M3 or Upstage for multilingual internal corpora.

7.5 Jina v3 — task adapter split

  • What changes: shared weights, task adapters produce different embeddings.
  • Why use it: retrieve + classify + match from one model.
  • What becomes possible: model count = 1, simpler ops.
  • Where it fits: RAG combined with other NLP tasks.
  • Limits: misclassified task prefix causes a measurable quality drop.

8. Limits and Failure Modes

8.1 Model change = full re-index

  • Why intrinsic: embedding spaces are not interchangeable; you cannot migrate incrementally.
  • Diagnosis: when considering a swap, estimate both re-index cost (API or GPU hours) and availability impact.
  • Mitigation: shortlist via an early evaluation set; if you do swap, run a blue-green index and route queries.
  • Later part: Part 9 (vector-DB collection split).

8.2 Asymmetric embedding ignored — query/passage confusion

  • Why intrinsic: E5, Upstage, Jina treat query and passage with different prefixes or weights. Missing one decouples the vector spaces.
  • Diagnosis: document-document similarity exceeds query-document similarity on the same corpus; retrieval recall collapses.
  • Mitigation: split embed_query() and embed_documents() in code. LangChain separates them by default.
  • Later part: Part 10 (dense retrieval — the meaning of asymmetry).

8.3 Missing normalisation breaks cosine

  • Why intrinsic: unnormalised outputs feeding a cosine index let norm differences dominate; ranking swings on chunk length.
  • Diagnosis: distribution of \(\|x\|\) far from 1 = no normalisation.
  • Mitigation: pass normalize_embeddings=True or do x / np.linalg.norm(x) client-side.
  • Later part: Part 9 (vector-DB metric setting — cosine vs dot).

8.4 Max input overflow — silent truncation

  • Why intrinsic: chunks exceeding max input are silently truncated; tail content never makes it into the embedding.
  • Diagnosis: chunk-token distribution — 95th percentile above 80% of max input is dangerous.
  • Mitigation: cap chunk size at \(0.8 \times \text{max input}\); monitor boundary stats.
  • Later part: Part 5 chunking integrated with Part 8 model.

8.5 Trusting benchmark averages

  • Why intrinsic: MTEB averages cover all domains; for legal, medical, or in-house Korean corpora the ranking can flip.
  • Diagnosis: poor retrieval recall after picking the MTEB leader without in-house evaluation.
  • Mitigation: shortlist 3–4 candidates and decide on a 50–200-pair custom evaluation set.
  • Later part: Parts 14 (eval sets) and 15 (search-quality metrics).

8.5 Common Pitfalls

  • "MTEB #1 = #1 on my corpus." §8.5.
  • "One model for both query and passage." §8.2. E5, Upstage, Jina are asymmetric.
  • "The library normalises automatically." §8.3. Be explicit.
  • "8K max input means 7K chunks are safe." Tokenisers differ; metadata prefix can push over.
  • "OpenAI 3-large is always best." Above 30% Korean share, BGE-M3 or Upstage frequently wins.

9. Settled Conclusions

Q1. Top picks for a corpus with Korean share ≥ 30%?

BGE-M3 (open, 1024d, strong multilingual) or Upstage Solar Embeddings (commercial, Korean-tuned). Chapter: §2, §4.

Q2. Why is a model change expensive in one sentence?

Embedding spaces are mutually incompatible — full re-index is required, no incremental migration. Chapter: §8.1.

Q3. What must always be added to E5-mistral queries?

A task instruction prefix in the form Instruct: {task}\nQuery: {q}. Omitting it costs measurable accuracy. Chapter: §6.3, §8.2.

Q4. How does Matryoshka affect storage cost?

\(d=3072 \to d'=512\) reduces storage to roughly 1/6. Accuracy loss is typically 1–3 percentage points. Chapter: §5, §7.2.

Q5. What is the final arbiter of embedding choice?

An in-house evaluation set (Part 14) — 50–200 query-document pairs to compare 3–4 shortlisted models. Chapter: §8.5.


10. Further Reading

Primary

  • BAAI. BGE-M3: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings. arXiv:2402.03216 (2024).
  • OpenAI. New embedding models and API updates (2024-01 blog) — Matryoshka rollout.
  • Wang, L. et al. Improving Text Embeddings with Large Language Models (E5-mistral). arXiv:2401.00368 (2024).
  • Jina AI. jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv:2409.10173 (2024).
  • Upstage. Solar Embeddings: a Korean-optimised embedding model (2024 tech report).
  • MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard (absolute ranks are corpus-dependent).

Official docs

  • BGE-M3 model card: https://huggingface.co/BAAI/bge-m3
  • OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings
  • Upstage Embeddings: https://developers.upstage.ai/docs/apis/embeddings
  • Voyage AI: https://docs.voyageai.com/
  • Jina v3: https://huggingface.co/jinaai/jina-embeddings-v3

Supporting

  • Author note Chapter 7 — embeddings.
  • Author note Chapter 35 §2 — Model-aware Ingestion, Title/Body Schema, Field-aware Embedding.

Cheat Sheet

Scenario First choice Second
Korean share ≥ 30%, open / local BGE-M3 E5-mistral
Korean share ≥ 30%, commercial API Upstage Solar OpenAI 3-large
English-centric, API Voyage 3 or OpenAI 3-large Jina v3
English-centric, open / local BGE-M3 E5-mistral
Large index (>10M), cost-first OpenAI 3-small + Matryoshka 512d BGE-M3 + sparse
RAG + classification / matching combined Jina v3 (task-aware) OpenAI 3

Selection rule of thumb: language mix · ops form (API vs local) · corpus size → three candidates → eval-set decision.


Bridge — What's Next

Next — RAG Core Study (9/26) — Vector DB Showdown: FAISS / Chroma / Qdrant / Milvus / Weaviate / Pinecone / pgvector.

With the embedding chosen, the next question is where to store it. Part 9 compares seven mainstream vector DBs on ANN algorithm (HNSW/IVF), metadata-filter expressiveness, update/delete support, and operational cost — and shows how Part 7's metadata design meets each DB's actual support.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System