Embeddings
Overview
Before I ran this code, I needed to understand: what am I actually creating?
An embedding is a direction in semantic space.
That's it. When you turn "What is the legal drinking age in Uganda?" into an embedding, you're converting it into a vector—a point in 384-dimensional space. But more importantly, it's a direction.
The magic isn't in the numbers themselves. It's in how those directions relate to each other.
The Journey from Text to Vector
Here's what happens when you call generateEmbedding():
Before we dive into similarity, I needed to understand: what's actually happening inside generateEmbedding()?
const embedding = await generateEmbedding("The legal drinking age...");
// Text goes in → [0.023, -0.145, 0.089, ..., 0.234] comes out
// But HOW?
The High-Level Process
Here's what happens when you generate an embedding:
- Tokenization: Convert text into tokens (sub-words or words)
- Token to ID: Map each token to an integer from the model's vocabulary
- Initial Embeddings: Convert each token ID into a dense vector
- Transformer Layers: Process tokens through attention mechanisms to add context
- Pooling: Combine all token vectors into one sentence vector
- Normalization: Scale to unit length
Let's break this down, because each step reveals something important.
Step 1: Tokenization - Breaking Text into Pieces
// Input text
"The legal drinking age in Uganda is 18 years old."[
// After tokenization (simplified)
("the",
"legal",
"drinking",
"age",
"in",
"uganda",
"is",
"18",
"years",
"old")
];
The model doesn't work with raw text - it works with tokens. These might be whole words, or sub-word pieces for rare/complex words.
Step 2: From Words to Initial Vectors
Each token gets mapped to a pre-learned vector:
// Simplified - real vectors have 384 dimensions
"drinking" → [0.23, -0.45, 0.12, ..., 0.67] // 384 numbers
"age" → [0.67, 0.89, -0.33, ..., 0.12] // 384 numbers
"legal" → [0.91, -0.12, 0.45, ..., 0.88] // 384 numbers
Where did these vectors come from?
The model was pre-trained on billions of words. It learned that words appearing in similar contexts should have similar vectors. This is called distributional semantics: "You shall know a word by the company it keeps."
Words that appear near each other in sentences ("drinking" and "alcohol", "legal" and "law") end up with similar vectors.
Step 3: Contextualization Through Transformer Layers
Here's where it gets interesting. The model doesn't just look at words in isolation:
// Without context, "bank" is ambiguous:
"bank" → [0.5, 0.2, ..., 0.8]
// With context, the vector gets updated:
"money in the bank" → [0.8, 0.9, ..., 0.1] // Financial meaning
"sitting by the bank" → [0.1, 0.3, ..., 0.9] // Geographical meaning
How does this work?
The transformer processes the entire sentence at once. Through self-attention, each word can "look at" other words:
- "bank" looks at "money" → shifts toward financial meaning
- "bank" looks at "river" → shifts toward geographical meaning
The word vectors get updated based on their surrounding context. This is why embeddings capture meaning, not just vocabulary.
Step 4: Pooling - From Word Vectors to Sentence Vector
After contextualization, we have 10 word vectors (one per token). But we need ONE vector for the whole sentence:
// Before pooling: 10 words × 384 dimensions each
const wordVectors = [
[0.1, 0.2, 0.3, ...], // "the"
[0.3, 0.4, 0.5, ...], // "legal"
[0.2, 0.6, 0.1, ...], // "drinking"
// ... 7 more word vectors
];
// After mean pooling: 1 × 384 dimensions
const sentenceVector = [
(0.1 + 0.3 + 0.2 + ...) / 10, // Average dimension 1
(0.2 + 0.4 + 0.6 + ...) / 10, // Average dimension 2
(0.3 + 0.5 + 0.1 + ...) / 10, // Average dimension 3
// ... 381 more averaged dimensions
];
Mean pooling averages all the word vectors. It captures the "overall meaning" of the sentence by combining the contextualized meaning of each word.
Step 5: Normalization
Finally, the vector gets scaled to unit length:
// Before normalization
[0.5, 0.3, 0.8, ...] // Length = √(0.5² + 0.3² + 0.8² + ...) = some number
// After normalization
[0.25, 0.15, 0.4, ...] // Length = 1.0
This ensures we're comparing direction only, not magnitude. Two sentences can have the same semantic direction regardless of how many words they contain.
Connecting Back to Our Code
Now when we see this:
const output = await extractor(text, {
pooling: "mean", // Average all contextualized word vectors
normalize: true, // Scale to unit length
});
We understand exactly what's happening:
- Input: "What is the legal drinking age in Uganda?"
- Tokenization: Split into tokens
- Initial vectors: Each token gets a pre-learned 384D vector
- Contextualization: Transformer layers update vectors based on context
- Pooling:
pooling: "mean"→ average all word vectors into one - Normalization:
normalize: true→ scale to length 1
The output is a single 384-dimensional vector that encodes the semantic meaning of the entire sentence.
The Beautiful Insight
The model isn't magic - it's a sophisticated pattern matcher that learned linguistic structure from billions of examples.
It learned that:
- "drinking" and "alcohol" appear in similar contexts → similar vectors
- The word "bank" near "money" means something different than "bank" near "river" → context-dependent vectors
- A sentence's meaning can be captured by averaging its contextualized word meanings → pooling works
This is why the same word in different contexts gets different embeddings, and why the model understands nuanced meaning.
Why This Matters
Instead of thinking:
"The model returns embeddings" (magic)
We now understand:
"The model tokenizes text, gives each token a learned vector, updates those vectors based on context using attention mechanisms, averages them into one vector, and normalizes it to unit length for clean comparison."
No magic. Just math and learned patterns.