"RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage"

5월 17, 2026

Chunks and metadata are ready. The next decision: which embedding model to index with.

Embedding choice is the decision RAG teams regret most. Once indexed, switching models means a full re-index, and the real differences in cost, latency, and language quality only surface in production. Part 8 compares the six mainstream models of 2026 — BGE-M3 / OpenAI text-embedding-3 / Upstage Solar Embeddings / E5-mistral / Jina v3 / Voyage 3 — across dimension, multilingual quality (esp. Korean), cost, local-execution, max input, and model-card semantics. It then turns author note Chapter 35 §2's Model-aware Ingestion principle into a model-card → schema mapping flow.

0. Prerequisites

Part 7 (metadata). A model change forces full metadata re-copy.
Cosine similarity / dot product basics (the bi-encoder formulation is detailed in Part 10).
Tokens ≠ characters — tokenizer differences matter.

1. Learning Objectives

State each of the six mainstream models in one differentiating line.
Express the impact of vector dimension, normalisation, and similarity choice on quality and cost.
Read a model card and turn it into an ingestion schema.
Estimate the cost of switching embedding models.

2. 핵심 요약

BGE-M3 (BAAI) is the open-source multilingual standard — 1024d, strong Korean and English. OpenAI text-embedding-3-small/large offers API convenience and Matryoshka (variable dimension). Upstage Solar is Korean-optimised commercial. E5-mistral-7b-instruct is instruction-tuned and open — but requires 7B-class inference. Jina v3 is task-aware (separate prefixes for retrieval, matching, classification). Voyage 3 is retrieval-tuned commercial — Korean is its weak point. Three-line selection rule: if Korean share ≥ 30%, prefer BGE-M3 or Upstage; if fully open / on-prem, BGE-M3 or E5; if API simplicity wins, OpenAI. The model card's training input format (title/body/query prefix) must be reflected in the ingestion schema to unlock latent quality — author note Chapter 35 §2's core point.

3. Intuition — Same Query, Six Embeddings, Six Top-Ks

Embedding "how do I file an exception to the security policy?" with the six models produces six different vectors. Even on the same corpus and chunking, top-K diverges. Some models are stronger on vocabulary anchors like "exception"; others on semantic chains like "filing → procedure".

Which model is best for your corpus is decided by your evaluation set (Part 14), not by an MTEB average. This article shrinks the candidate set; the right answer needs measurement.

4. Definitions — Six Mainstream Models (2026)

Model	Dim	Max input	Multilingual / Korean (0-3)	Cost (1M tokens)	Local	Note
BGE-M3 (BAAI)	1024	8192	3 / 3	$0 (own GPU)	✅	dense + sparse + colbert in one
OpenAI text-embedding-3-small	1536 (var)	8191	2 / 2	$0.02	❌	Matryoshka, low cost
OpenAI text-embedding-3-large	3072 (var)	8191	3 / 2	$0.13	❌	Highest accuracy
Upstage Solar Embeddings 1.5	4096	4000	2 / 3	$0.10	❌	Korean-tuned
E5-mistral-7b-instruct	4096	32K	3 / 2	$0 (7B GPU)	✅ (heavy)	instruction-tuned, MTEB top
Jina embeddings v3	1024 (var)	8192	3 / 2	$0.018	✅ (Apache 2.0)	task-aware prefix
Voyage 3	1024	32K	2 / 0	$0.06	❌	retrieval-tuned; weak Korean

Scores are relative, on a 0–3 scale (based on MTEB-Korean and MTEB multilingual averages). Absolute scores need an in-house evaluation set (Part 14).

5. Math — Dimension, Normalisation, Similarity

Three variables govern cost and accuracy.

$d$ = vector dimension
$N$ = chunk count
$\|x\|$ = embedding norm

Storage (float32):

$$\text{Storage} = N \cdot d \cdot 4 \text{ bytes}$$

e.g. $N=1M$, $d=1024$ → 4 GB; $d=4096$ → 16 GB. A 4× swing in storage and memory.

Search latency (HNSW approximation):

$$\text{Latency} \approx \mathcal{O}(d \cdot \log N)$$

Linear in $d$. 1024 vs 4096 is roughly a 4× latency gap.

Similarity functions:

Cosine: $\cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|}$. Direction only.
Dot product: $x \cdot y$. Norm-sensitive — short vectors penalised.
Euclidean: $\|x - y\|$. Distance-based.

Key rule: if the model outputs L2-normalised vectors, cosine = dot = (monotonic in) euclidean. Voyage and OpenAI output normalised; BGE-M3 and E5 also normalise in standard usage. Raw outputs without normalisation behave differently — be explicit.

Matryoshka embeddings (OpenAI 3, Jina v3): trained to be useful at multiple truncations — keep the first $d'$ dims and lose minimal quality. $d'=256$ yields 16× storage savings on a 3072-dim model.

6. Walkthrough — Model Card to Ingestion Schema

6.1 Four things to read from a model card

Item	Decides	Example
Training input format	ingestion schema	E5: `"passage: {text}"` / `"query: {text}"`
Max input length	chunk max size	BGE-M3 = 8192, Voyage 3 = 32K
Normalisation	similarity function	BGE-M3 recommends normalise → cosine
Instruction tuning	query preprocessing	E5-instruct: prepend task instruction

6.2 BGE-M3 — Standard example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

doc_emb = model.encode(
    chunk_text,
    normalize_embeddings=True,   # for cosine
)

query_emb = model.encode(
    "exception filing procedure?",
    normalize_embeddings=True,
)

6.3 E5-mistral — Instruction prefix is decisive

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")

TASK = "Given a web search query, retrieve relevant passages that answer the query"

query_text = f"Instruct: {TASK}\nQuery: exception filing procedure?"
query_emb = model.encode(query_text, normalize_embeddings=True)

doc_emb = model.encode(chunk_text, normalize_embeddings=True)

Skipping this prefix causes a noticeable quality drop. Follow the model card literally.

6.4 Jina v3 — Task-aware prefix

from transformers import AutoModel

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)

query_emb = model.encode(["exception filing procedure?"], task="retrieval.query")
doc_emb   = model.encode([chunk_text],                    task="retrieval.passage")

Same weights, different task adapters. Retrieval treats query and passage as asymmetric.

6.5 OpenAI — Matryoshka truncation

from openai import OpenAI
client = OpenAI()

resp = client.embeddings.create(
    model="text-embedding-3-large",
    input=chunk_text,
    dimensions=512,        # 3072 → 512, valid because trained that way
)
emb = resp.data[0].embedding

Storage and latency cut by 6×; accuracy loss is typically 1–3 percentage points and task-dependent. Strong when cost dominates.

6.6 Upstage Solar — Korean-tuned, asymmetric

from openai import OpenAI
client = OpenAI(api_key=UPSTAGE_KEY, base_url="https://api.upstage.ai/v1/solar")

resp = client.embeddings.create(
    model="solar-embedding-1-large-passage",   # passage / query models are separate
    input=chunk_text,
)

-passage and -query are different models. Query embeddings must use -query. A canonical asymmetric embedding setup.

7. Variants

7.1 BGE-M3 — 3-in-1 (Dense + Sparse + ColBERT)

What changes: a single model emits dense, sparse (BM25-like), and late-interaction (ColBERT) outputs.
Why use it: Hybrid Search (Part 12) from one model — operational simplicity.
What becomes possible: no separate sparse / ColBERT weights to manage.
Where it fits: multilingual + hybrid + ops-simplicity projects.
Limits: in pure single-mode comparisons, a specialised model edges it slightly.

7.2 OpenAI 3-large — dimension-reduction trade-off

What changes: trained for 3072 / 1024 / 512 / 256-dim truncations.
Why use it: storage and search scale by $\times 1/d_{\text{ratio}}$.
What becomes possible: cost-decisive savings at multi-TB index scale.
Where it fits: large indices (>10M chunks), cost-sensitive.
Limits: 1–3 percentage points accuracy loss. Measure via Part 14.

7.3 E5-mistral — the 7B embedding cost trade

What changes: the embedding is a 7B-parameter LLM. GPU required.
Why use it: top MTEB scores with instruction tuning.
What becomes possible: re-use the model for many tasks via different instructions.
Where it fits: in-house GPU infrastructure with model-control needs.
Limits: 100–500 ms latency depending on GPU; chunk cost scales with GPU-hour.

7.4 Voyage 3 — retrieval-tuned, Korean-light

What changes: heavily fine-tuned on retrieval objectives.
Why use it: often beats OpenAI on English retrieval.
What becomes possible: English corpus with balanced cost/quality.
Where it fits: English-first RAG.
Limits: limited Korean training share → not recommended for Korean corpora. Use BGE-M3 or Upstage for multilingual internal corpora.

7.5 Jina v3 — task adapter split

What changes: shared weights, task adapters produce different embeddings.
Why use it: retrieve + classify + match from one model.
What becomes possible: model count = 1, simpler ops.
Where it fits: RAG combined with other NLP tasks.
Limits: misclassified task prefix causes a measurable quality drop.

8. Limits and Failure Modes

8.1 Model change = full re-index

Why intrinsic: embedding spaces are not interchangeable; you cannot migrate incrementally.
Diagnosis: when considering a swap, estimate both re-index cost (API or GPU hours) and availability impact.
Mitigation: shortlist via an early evaluation set; if you do swap, run a blue-green index and route queries.
Later part: Part 9 (vector-DB collection split).

8.2 Asymmetric embedding ignored — query/passage confusion

Why intrinsic: E5, Upstage, Jina treat query and passage with different prefixes or weights. Missing one decouples the vector spaces.
Diagnosis: document-document similarity exceeds query-document similarity on the same corpus; retrieval recall collapses.
Mitigation: split embed_query() and embed_documents() in code. LangChain separates them by default.
Later part: Part 10 (dense retrieval — the meaning of asymmetry).

8.3 Missing normalisation breaks cosine

Why intrinsic: unnormalised outputs feeding a cosine index let norm differences dominate; ranking swings on chunk length.
Diagnosis: distribution of $\|x\|$ far from 1 = no normalisation.
Mitigation: pass normalize_embeddings=True or do x / np.linalg.norm(x) client-side.
Later part: Part 9 (vector-DB metric setting — cosine vs dot).

8.4 Max input overflow — silent truncation

Why intrinsic: chunks exceeding max input are silently truncated; tail content never makes it into the embedding.
Diagnosis: chunk-token distribution — 95th percentile above 80% of max input is dangerous.
Mitigation: cap chunk size at $0.8 \times \text{max input}$; monitor boundary stats.
Later part: Part 5 chunking integrated with Part 8 model.

8.5 Trusting benchmark averages

Why intrinsic: MTEB averages cover all domains; for legal, medical, or in-house Korean corpora the ranking can flip.
Diagnosis: poor retrieval recall after picking the MTEB leader without in-house evaluation.
Mitigation: shortlist 3–4 candidates and decide on a 50–200-pair custom evaluation set.
Later part: Parts 14 (eval sets) and 15 (search-quality metrics).

8.5 Common Pitfalls

"MTEB #1 = #1 on my corpus." §8.5.
"One model for both query and passage." §8.2. E5, Upstage, Jina are asymmetric.
"The library normalises automatically." §8.3. Be explicit.
"8K max input means 7K chunks are safe." Tokenisers differ; metadata prefix can push over.
"OpenAI 3-large is always best." Above 30% Korean share, BGE-M3 or Upstage frequently wins.

9. Settled Conclusions

Q1. Top picks for a corpus with Korean share ≥ 30%?

BGE-M3 (open, 1024d, strong multilingual) or Upstage Solar Embeddings (commercial, Korean-tuned). Chapter: §2, §4.

Q2. Why is a model change expensive in one sentence?

Embedding spaces are mutually incompatible — full re-index is required, no incremental migration. Chapter: §8.1.

Q3. What must always be added to E5-mistral queries?

A task instruction prefix in the form Instruct: {task}\nQuery: {q}. Omitting it costs measurable accuracy. Chapter: §6.3, §8.2.

Q4. How does Matryoshka affect storage cost?

$d=3072 \to d'=512$ reduces storage to roughly 1/6. Accuracy loss is typically 1–3 percentage points. Chapter: §5, §7.2.

Q5. What is the final arbiter of embedding choice?

An in-house evaluation set (Part 14) — 50–200 query-document pairs to compare 3–4 shortlisted models. Chapter: §8.5.

10. Further Reading

Primary

BAAI. BGE-M3: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings. arXiv:2402.03216 (2024).
OpenAI. New embedding models and API updates (2024-01 blog) — Matryoshka rollout.
Wang, L. et al. Improving Text Embeddings with Large Language Models (E5-mistral). arXiv:2401.00368 (2024).
Jina AI. jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv:2409.10173 (2024).
Upstage. Solar Embeddings: a Korean-optimised embedding model (2024 tech report).
MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard (absolute ranks are corpus-dependent).

Official docs

BGE-M3 model card: https://huggingface.co/BAAI/bge-m3
OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings
Upstage Embeddings: https://developers.upstage.ai/docs/apis/embeddings
Voyage AI: https://docs.voyageai.com/
Jina v3: https://huggingface.co/jinaai/jina-embeddings-v3

Supporting

Author note Chapter 7 — embeddings.
Author note Chapter 35 §2 — Model-aware Ingestion, Title/Body Schema, Field-aware Embedding.

Cheat Sheet

Scenario	First choice	Second
Korean share ≥ 30%, open / local	BGE-M3	E5-mistral
Korean share ≥ 30%, commercial API	Upstage Solar	OpenAI 3-large
English-centric, API	Voyage 3 or OpenAI 3-large	Jina v3
English-centric, open / local	BGE-M3	E5-mistral
Large index (>10M), cost-first	OpenAI 3-small + Matryoshka 512d	BGE-M3 + sparse
RAG + classification / matching combined	Jina v3 (task-aware)	OpenAI 3

Selection rule of thumb: language mix · ops form (API vs local) · corpus size → three candidates → eval-set decision.

Bridge — What's Next

Next — RAG Core Study (9/26) — Vector DB Showdown: FAISS / Chroma / Qdrant / Milvus / Weaviate / Pinecone / pgvector.

With the embedding chosen, the next question is where to store it. Part 9 compares seven mainstream vector DBs on ANN algorithm (HNSW/IVF), metadata-filter expressiveness, update/delete support, and operational cost — and shows how Part 7's metadata design meets each DB's actual support.

Series overview: Series index