"RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage"
Chunks and metadata are ready. The next decision: which embedding model to index with.
Embedding choice is the decision RAG teams regret most. Once indexed, switching models means a full re-index, and the real differences in cost, latency, and language quality only surface in production. Part 8 compares the six mainstream models of 2026 — BGE-M3 / OpenAI text-embedding-3 / Upstage Solar Embeddings / E5-mistral / Jina v3 / Voyage 3 — across dimension, multilingual quality (esp. Korean), cost, local-execution, max input, and model-card semantics. It then turns author note Chapter 35 §2's Model-aware Ingestion principle into a model-card → schema mapping flow.
0. Prerequisites
- Part 7 (metadata). A model change forces full metadata re-copy.
- Cosine similarity / dot product basics (the bi-encoder formulation is detailed in Part 10).
- Tokens ≠ characters — tokenizer differences matter.
1. Learning Objectives
- State each of the six mainstream models in one differentiating line.
- Express the impact of vector dimension, normalisation, and similarity choice on quality and cost.
- Read a model card and turn it into an ingestion schema.
- Estimate the cost of switching embedding models.
2. ํต์ฌ ์์ฝ
BGE-M3 (BAAI) is the open-source multilingual standard — 1024d, strong Korean and English. OpenAI text-embedding-3-small/large offers API convenience and Matryoshka (variable dimension). Upstage Solar is Korean-optimised commercial. E5-mistral-7b-instruct is instruction-tuned and open — but requires 7B-class inference. Jina v3 is task-aware (separate prefixes for retrieval, matching, classification). Voyage 3 is retrieval-tuned commercial — Korean is its weak point. Three-line selection rule: if Korean share ≥ 30%, prefer BGE-M3 or Upstage; if fully open / on-prem, BGE-M3 or E5; if API simplicity wins, OpenAI. The model card's training input format (title/body/query prefix) must be reflected in the ingestion schema to unlock latent quality — author note Chapter 35 §2's core point.
3. Intuition — Same Query, Six Embeddings, Six Top-Ks
Embedding "how do I file an exception to the security policy?" with the six models produces six different vectors. Even on the same corpus and chunking, top-K diverges. Some models are stronger on vocabulary anchors like "exception"; others on semantic chains like "filing → procedure".
Which model is best for your corpus is decided by your evaluation set (Part 14), not by an MTEB average. This article shrinks the candidate set; the right answer needs measurement.
4. Definitions — Six Mainstream Models (2026)
| Model | Dim | Max input | Multilingual / Korean (0-3) | Cost (1M tokens) | Local | Note |
|---|---|---|---|---|---|---|
| BGE-M3 (BAAI) | 1024 | 8192 | 3 / 3 | $0 (own GPU) | ✅ | dense + sparse + colbert in one |
| OpenAI text-embedding-3-small | 1536 (var) | 8191 | 2 / 2 | $0.02 | ❌ | Matryoshka, low cost |
| OpenAI text-embedding-3-large | 3072 (var) | 8191 | 3 / 2 | $0.13 | ❌ | Highest accuracy |
| Upstage Solar Embeddings 1.5 | 4096 | 4000 | 2 / 3 | $0.10 | ❌ | Korean-tuned |
| E5-mistral-7b-instruct | 4096 | 32K | 3 / 2 | $0 (7B GPU) | ✅ (heavy) | instruction-tuned, MTEB top |
| Jina embeddings v3 | 1024 (var) | 8192 | 3 / 2 | $0.018 | ✅ (Apache 2.0) | task-aware prefix |
| Voyage 3 | 1024 | 32K | 2 / 0 | $0.06 | ❌ | retrieval-tuned; weak Korean |
Scores are relative, on a 0–3 scale (based on MTEB-Korean and MTEB multilingual averages). Absolute scores need an in-house evaluation set (Part 14).
5. Math — Dimension, Normalisation, Similarity
Three variables govern cost and accuracy.
- \(d\) = vector dimension
- \(N\) = chunk count
- \(\|x\|\) = embedding norm
Storage (float32):
$$\text{Storage} = N \cdot d \cdot 4 \text{ bytes}$$
e.g. \(N=1M\), \(d=1024\) → 4 GB; \(d=4096\) → 16 GB. A 4× swing in storage and memory.
Search latency (HNSW approximation):
$$\text{Latency} \approx \mathcal{O}(d \cdot \log N)$$
Linear in \(d\). 1024 vs 4096 is roughly a 4× latency gap.
Similarity functions:
- Cosine: \(\cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|}\). Direction only.
- Dot product: \(x \cdot y\). Norm-sensitive — short vectors penalised.
- Euclidean: \(\|x - y\|\). Distance-based.
Key rule: if the model outputs L2-normalised vectors, cosine = dot = (monotonic in) euclidean. Voyage and OpenAI output normalised; BGE-M3 and E5 also normalise in standard usage. Raw outputs without normalisation behave differently — be explicit.
Matryoshka embeddings (OpenAI 3, Jina v3): trained to be useful at multiple truncations — keep the first \(d'\) dims and lose minimal quality. \(d'=256\) yields 16× storage savings on a 3072-dim model.
6. Walkthrough — Model Card to Ingestion Schema
6.1 Four things to read from a model card
| Item | Decides | Example |
|---|---|---|
| Training input format | ingestion schema | E5: "passage: {text}" / "query: {text}" |
| Max input length | chunk max size | BGE-M3 = 8192, Voyage 3 = 32K |
| Normalisation | similarity function | BGE-M3 recommends normalise → cosine |
| Instruction tuning | query preprocessing | E5-instruct: prepend task instruction |
6.2 BGE-M3 — Standard example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
doc_emb = model.encode(
chunk_text,
normalize_embeddings=True, # for cosine
)
query_emb = model.encode(
"exception filing procedure?",
normalize_embeddings=True,
)
6.3 E5-mistral — Instruction prefix is decisive
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
TASK = "Given a web search query, retrieve relevant passages that answer the query"
query_text = f"Instruct: {TASK}\nQuery: exception filing procedure?"
query_emb = model.encode(query_text, normalize_embeddings=True)
doc_emb = model.encode(chunk_text, normalize_embeddings=True)
Skipping this prefix causes a noticeable quality drop. Follow the model card literally.
6.4 Jina v3 — Task-aware prefix
from transformers import AutoModel
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
query_emb = model.encode(["exception filing procedure?"], task="retrieval.query")
doc_emb = model.encode([chunk_text], task="retrieval.passage")
Same weights, different task adapters. Retrieval treats query and passage as asymmetric.
6.5 OpenAI — Matryoshka truncation
from openai import OpenAI
client = OpenAI()
resp = client.embeddings.create(
model="text-embedding-3-large",
input=chunk_text,
dimensions=512, # 3072 → 512, valid because trained that way
)
emb = resp.data[0].embedding
Storage and latency cut by 6×; accuracy loss is typically 1–3 percentage points and task-dependent. Strong when cost dominates.
6.6 Upstage Solar — Korean-tuned, asymmetric
from openai import OpenAI
client = OpenAI(api_key=UPSTAGE_KEY, base_url="https://api.upstage.ai/v1/solar")
resp = client.embeddings.create(
model="solar-embedding-1-large-passage", # passage / query models are separate
input=chunk_text,
)
-passage and -query are different models. Query embeddings must use -query. A canonical asymmetric embedding setup.
7. Variants
7.1 BGE-M3 — 3-in-1 (Dense + Sparse + ColBERT)
- What changes: a single model emits dense, sparse (BM25-like), and late-interaction (ColBERT) outputs.
- Why use it: Hybrid Search (Part 12) from one model — operational simplicity.
- What becomes possible: no separate sparse / ColBERT weights to manage.
- Where it fits: multilingual + hybrid + ops-simplicity projects.
- Limits: in pure single-mode comparisons, a specialised model edges it slightly.
7.2 OpenAI 3-large — dimension-reduction trade-off
- What changes: trained for 3072 / 1024 / 512 / 256-dim truncations.
- Why use it: storage and search scale by \(\times 1/d_{\text{ratio}}\).
- What becomes possible: cost-decisive savings at multi-TB index scale.
- Where it fits: large indices (>10M chunks), cost-sensitive.
- Limits: 1–3 percentage points accuracy loss. Measure via Part 14.
7.3 E5-mistral — the 7B embedding cost trade
- What changes: the embedding is a 7B-parameter LLM. GPU required.
- Why use it: top MTEB scores with instruction tuning.
- What becomes possible: re-use the model for many tasks via different instructions.
- Where it fits: in-house GPU infrastructure with model-control needs.
- Limits: 100–500 ms latency depending on GPU; chunk cost scales with GPU-hour.
7.4 Voyage 3 — retrieval-tuned, Korean-light
- What changes: heavily fine-tuned on retrieval objectives.
- Why use it: often beats OpenAI on English retrieval.
- What becomes possible: English corpus with balanced cost/quality.
- Where it fits: English-first RAG.
- Limits: limited Korean training share → not recommended for Korean corpora. Use BGE-M3 or Upstage for multilingual internal corpora.
7.5 Jina v3 — task adapter split
- What changes: shared weights, task adapters produce different embeddings.
- Why use it: retrieve + classify + match from one model.
- What becomes possible: model count = 1, simpler ops.
- Where it fits: RAG combined with other NLP tasks.
- Limits: misclassified task prefix causes a measurable quality drop.
8. Limits and Failure Modes
8.1 Model change = full re-index
- Why intrinsic: embedding spaces are not interchangeable; you cannot migrate incrementally.
- Diagnosis: when considering a swap, estimate both re-index cost (API or GPU hours) and availability impact.
- Mitigation: shortlist via an early evaluation set; if you do swap, run a blue-green index and route queries.
- Later part: Part 9 (vector-DB collection split).
8.2 Asymmetric embedding ignored — query/passage confusion
- Why intrinsic: E5, Upstage, Jina treat query and passage with different prefixes or weights. Missing one decouples the vector spaces.
- Diagnosis: document-document similarity exceeds query-document similarity on the same corpus; retrieval recall collapses.
- Mitigation: split
embed_query()andembed_documents()in code. LangChain separates them by default. - Later part: Part 10 (dense retrieval — the meaning of asymmetry).
8.3 Missing normalisation breaks cosine
- Why intrinsic: unnormalised outputs feeding a cosine index let norm differences dominate; ranking swings on chunk length.
- Diagnosis: distribution of \(\|x\|\) far from 1 = no normalisation.
- Mitigation: pass
normalize_embeddings=Trueor dox / np.linalg.norm(x)client-side. - Later part: Part 9 (vector-DB metric setting — cosine vs dot).
8.4 Max input overflow — silent truncation
- Why intrinsic: chunks exceeding max input are silently truncated; tail content never makes it into the embedding.
- Diagnosis: chunk-token distribution — 95th percentile above 80% of max input is dangerous.
- Mitigation: cap chunk size at \(0.8 \times \text{max input}\); monitor boundary stats.
- Later part: Part 5 chunking integrated with Part 8 model.
8.5 Trusting benchmark averages
- Why intrinsic: MTEB averages cover all domains; for legal, medical, or in-house Korean corpora the ranking can flip.
- Diagnosis: poor retrieval recall after picking the MTEB leader without in-house evaluation.
- Mitigation: shortlist 3–4 candidates and decide on a 50–200-pair custom evaluation set.
- Later part: Parts 14 (eval sets) and 15 (search-quality metrics).
8.5 Common Pitfalls
- "MTEB #1 = #1 on my corpus." §8.5.
- "One model for both query and passage." §8.2. E5, Upstage, Jina are asymmetric.
- "The library normalises automatically." §8.3. Be explicit.
- "8K max input means 7K chunks are safe." Tokenisers differ; metadata prefix can push over.
- "OpenAI 3-large is always best." Above 30% Korean share, BGE-M3 or Upstage frequently wins.
9. Settled Conclusions
Q1. Top picks for a corpus with Korean share ≥ 30%?
BGE-M3 (open, 1024d, strong multilingual) or Upstage Solar Embeddings (commercial, Korean-tuned). Chapter: §2, §4.
Q2. Why is a model change expensive in one sentence?
Embedding spaces are mutually incompatible — full re-index is required, no incremental migration. Chapter: §8.1.
Q3. What must always be added to E5-mistral queries?
A task instruction prefix in the form Instruct: {task}\nQuery: {q}. Omitting it costs measurable accuracy.
Chapter: §6.3, §8.2.
Q4. How does Matryoshka affect storage cost?
\(d=3072 \to d'=512\) reduces storage to roughly 1/6. Accuracy loss is typically 1–3 percentage points. Chapter: §5, §7.2.
Q5. What is the final arbiter of embedding choice?
An in-house evaluation set (Part 14) — 50–200 query-document pairs to compare 3–4 shortlisted models. Chapter: §8.5.
10. Further Reading
Primary
- BAAI. BGE-M3: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings. arXiv:2402.03216 (2024).
- OpenAI. New embedding models and API updates (2024-01 blog) — Matryoshka rollout.
- Wang, L. et al. Improving Text Embeddings with Large Language Models (E5-mistral). arXiv:2401.00368 (2024).
- Jina AI. jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv:2409.10173 (2024).
- Upstage. Solar Embeddings: a Korean-optimised embedding model (2024 tech report).
- MTEB Leaderboard:
https://huggingface.co/spaces/mteb/leaderboard(absolute ranks are corpus-dependent).
Official docs
- BGE-M3 model card:
https://huggingface.co/BAAI/bge-m3 - OpenAI Embeddings:
https://platform.openai.com/docs/guides/embeddings - Upstage Embeddings:
https://developers.upstage.ai/docs/apis/embeddings - Voyage AI:
https://docs.voyageai.com/ - Jina v3:
https://huggingface.co/jinaai/jina-embeddings-v3
Supporting
- Author note Chapter 7 — embeddings.
- Author note Chapter 35 §2 — Model-aware Ingestion, Title/Body Schema, Field-aware Embedding.
Cheat Sheet
| Scenario | First choice | Second |
|---|---|---|
| Korean share ≥ 30%, open / local | BGE-M3 | E5-mistral |
| Korean share ≥ 30%, commercial API | Upstage Solar | OpenAI 3-large |
| English-centric, API | Voyage 3 or OpenAI 3-large | Jina v3 |
| English-centric, open / local | BGE-M3 | E5-mistral |
| Large index (>10M), cost-first | OpenAI 3-small + Matryoshka 512d | BGE-M3 + sparse |
| RAG + classification / matching combined | Jina v3 (task-aware) | OpenAI 3 |
Selection rule of thumb: language mix · ops form (API vs local) · corpus size → three candidates → eval-set decision.
Bridge — What's Next
Next — RAG Core Study (9/26) — Vector DB Showdown: FAISS / Chroma / Qdrant / Milvus / Weaviate / Pinecone / pgvector.
With the embedding chosen, the next question is where to store it. Part 9 compares seven mainstream vector DBs on ANN algorithm (HNSW/IVF), metadata-filter expressiveness, update/delete support, and operational cost — and shows how Part 7's metadata design meets each DB's actual support.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ