Skip to main content

Embeddings Deep Dive

This page goes one level below RAG: how the embedding layer actually works in production, how to choose and operate an embedding model, and the pitfalls that quietly degrade retrieval quality. If you only need the basics, the RAG page covers them; come here when you are building or debugging a retrieval system.

What an embedding is

An embedding is a dense numeric vector (typically 384--4096 dimensions) representing a piece of text, image, or audio, chosen so that semantically similar inputs land close together in vector space.

  • Dense and semantic. Unlike sparse one-hot or bag-of-words vectors (mostly zeros, no relationships), embeddings are mostly non-zero and meaningful. The classic illustration: king - man + woman ~= queen.
  • Static vs context-dependent. Static embeddings (Word2Vec, GloVe) give a word the same vector everywhere. Modern transformer embeddings are contextual: "bank" gets different vectors in "river bank" versus "financial bank". Static word embeddings are obsolete for new applications; sentence-level transformer embeddings are the 2026 default.

How embedding models are produced

A transformer encoder maps tokenized input through several attention layers and produces a fixed-size vector per input -- often the pooled average over tokens, or a dedicated summary token. The model is trained with contrastive objectives: pairs known to be similar (question + answer, query + click-through, sentence pairs) should produce close vectors; random pairs should be far apart. Hundreds of millions of such pairs teach the model a useful geometry.

The two metrics that matter downstream:

  • Cosine similarity -- the cosine of the angle between two vectors, measuring similarity regardless of length. The standard metric for text. Most LLM embeddings are normalized to unit length, so cosine and inner product rank identically (inner product is just faster).
  • Semantic search -- the capability embeddings unlock: a query for "what's happening with a fast animal?" can retrieve "the quick brown fox jumps over the lazy dog" with no shared keywords.

The 2026 model landscape

A working snapshot -- names and prices change, so check vendor pages before committing.

Proprietary

ModelDimensionsStrengthsWatch out
OpenAI text-embedding-3-smallup to 1536Cheap, Matryoshka-truncatable, well-roundedClosed; no fine-tune
OpenAI text-embedding-3-largeup to 3072Strongest OpenAI option; MatryoshkaMost expensive; closed
Cohere embed-v31024Strong multilingual; input-type awareClosed
Voyage voyage-3 family256--2048Often top of MTEB; long-context; domain variants (code, finance, law)Smaller ecosystem
Google gemini-embedding-001up to 3072Strong multilingual; Gemini stackClosed

Open-source (self-host or via a host)

ModelDimensionsStrengthsWatch out
BAAI bge-m31024Multilingual; dense + sparse + multi-vector in one modelHeavier
BAAI bge-small-en-v1.5384Tiny, fast; common tutorial defaultEnglish-only; modest quality
Microsoft e5-large-v21024Instruction-tuned ("query:" / "passage:" prefixes)Prefix discipline matters
NVIDIA nv-embed-v24096Historically top MTEB scoresLarge; GPU recommended
jina-embeddings-v31024Long-context (8k), task conditioningNewer ecosystem

The MTEB (Massive Text Embedding Benchmark) leaderboard is a useful starting filter, not truth: it averages across 56 tasks, and your task may be nothing like the average.

Choosing a model

Pick along these axes, roughly in this order:

  1. Quality on your task. Build a small (200--1000 row) eval set from real queries and judged-good documents, then measure retrieval recall@k per candidate. Do not trust MTEB blindly.
  2. Cost at scale. Per-1M-token price x corpus size x refresh rate. Self-hosted open-source has very different economics from API pricing.
  3. Latency budget. Every query is re-embedded, so query-embedding latency is part of every request's p99.
  4. Multilingual / domain. Global users need multilingual models; finance/legal/code benefit from domain-specialized variants.
  5. Operability. Managed API (low ops, vendor risk) vs self-hosted (control, GPU cost).
  6. Dimension flexibility. Matryoshka-aware models let you store small and search smaller (see below).

A good 2026 default for English RAG: text-embedding-3-small (cheap, well-behaved) or bge-small-en-v1.5 (self-hostable). Upgrade only when eval shows the small model is the bottleneck.

Dimensions, Matryoshka, and quantization

Higher dimensions are not always better -- for a fixed model, more dimensions means better quality but more storage and more ANN compute. Three production levers:

  • Matryoshka Representation Learning (MRL). Models trained so the first N dimensions are themselves a usable embedding. Slice a 1536-dim vector to 256 dims for ~6x less storage and compute at modest quality cost. OpenAI text-embedding-3 supports this via the dimensions parameter.
  • Vector quantization. Store vectors as int8, binary, or halfvec instead of float32 for 4x--32x storage reduction with minimal recall loss. Note this compresses the embedding vectors, distinct from model weight quantization. A subtle trap: quantizers optimized purely for reconstruction error (MSE) can bias inner-product estimates -- for retrieval, what matters is unbiased similarity preservation, not reconstruction.
  • Product Quantization (PQ / OPQ). The classical ANN compression family in FAISS, OpenSearch, and Milvus, used when partitioning a very large index matters more than per-vector precision.

Rule of thumb: start at float32 and full dimension; once it works, quantize to int8 / halfvec for ~4x storage reduction at a ~1--2% recall hit.

Domain adaptation

When an off-the-shelf model is not good enough on your data, in increasing order of effort:

  1. Instructions / prefixes. Many models accept task instructions ("query:" / "passage:" or "Represent this query for retrieving supporting documents"). Free, often material.
  2. Reranking. Run a cross-encoder reranker (Cohere rerank-3, bge-reranker-large) over the top ~50 candidates. Usually a bigger win than swapping the embedding model.
  3. Fine-tuning. Train the embedding model on (query, positive doc) pairs with sentence-transformers and a contrastive loss. Cost: hours on one GPU; gains: 5--20% recall@k on in-domain queries.
  4. Train a new model. Almost never the right call for application work.

Hybrid retrieval

Dense embeddings miss exact-string queries ("error code ABC-1234"); keyword search misses paraphrases ("the part that won't turn on" vs "ignition failure"). The 2026 answer is hybrid retrieval: embed the query for both dense (semantic) and sparse (BM25 / SPLADE) retrieval, then fuse with reciprocal rank fusion or weighted scoring, and rerank on top. Hybrid is first-class in Pinecone, OpenSearch, Weaviate, Qdrant, and Milvus. If RAG quality is weak and you are embedding-only, add hybrid retrieval before anything else -- it is the single highest-ROI lever in most enterprise deployments.

Production pitfalls

  • Query/document asymmetry. Many models (E5, BGE, Cohere v3) need different prefixes or input-types for queries vs documents. Forgetting this halves recall.
  • Chunk-size mismatch. Each model has a natural chunk size (256--512 tokens for bge-small, 8k for voyage-3). Chunking far outside it degrades quality.
  • Stale index. Embeddings drift from new content over time; schedule periodic re-embedding.
  • Model swap = full re-embed. Changing embedding models means re-embedding the whole corpus. Treat the embedding model as a versioned artifact.
  • Cosine vs inner product. Verify vectors are normalized before relying on inner product; otherwise use cosine.
  • Evaluating at the wrong stage. Embedding recall alone is not the goal; score end-to-end RAG answer quality with a harness (see Evaluation and LLMOps).

See also