graph TD
subgraph Input["Input Layer: One-Hot Encoded Context Words"]
A1["Context Word 1<br/>'the'<br/>[0,0,0,1,0,...,0]<br/>V-dimensional"]
A2["Context Word 2<br/>'economy'<br/>[0,0,0,0,1,...,0]<br/>V-dimensional"]
A3["Context Word 3<br/>'strong'<br/>[0,1,0,0,0,...,0]<br/>V-dimensional"]
A4["Context Word 4<br/>'and'<br/>[0,0,1,0,0,...,0]<br/>V-dimensional"]
end
subgraph Weights1["Weight Matrix W_in (V × N)"]
W1["Each row = input embedding<br/>for one word<br/>Dimensions: V × N<br/>(V = vocab size, N = embedding dim)"]
end
A1 -->|"v₁ = W_in^T × x₁"| W1
A2 -->|"v₂ = W_in^T × x₂"| W1
A3 -->|"v₃ = W_in^T × x₃"| W1
A4 -->|"v₄ = W_in^T × x₄"| W1
subgraph Hidden["Hidden Layer: Average Context Vector"]
H["h = (1/C) Σ vᵢ<br/>Average of context embeddings<br/>N-dimensional vector<br/>(No activation function)"]
end
W1 -->|"Extract & average<br/>word vectors"| H
subgraph Weights2["Weight Matrix W_out (N × V)"]
W2["Each column = output embedding<br/>for one word<br/>Dimensions: N × V"]
end
H -->|"u = W_out^T × h"| W2
subgraph Scores["Score Layer"]
S["u_w for each word w<br/>V scores (one per vocab word)<br/>u_w = W_out[:,w]^T × h"]
end
W2 --> S
subgraph Softmax["Softmax Layer"]
SM["P(w|context) = exp(u_w) / Σ exp(u_w')<br/>Convert scores to probabilities<br/>All probabilities sum to 1"]
end
S --> SM
subgraph Output["Output: Predicted Word Probabilities"]
O1["P('is') = 0.45"]
O2["P('very') = 0.20"]
O3["P('remains') = 0.15"]
O4["P('...') = 0.20"]
end
SM --> O1
SM --> O2
SM --> O3
SM --> O4
subgraph Loss["Loss Function"]
L["L = -log P(w_target|context)<br/>= -u_w_target + log Σ exp(u_w)<br/><br/>Minimize negative log likelihood<br/>of correct target word"]
end
O1 -.->|"Compare to<br/>actual target"| L
subgraph Gradient["Gradient Computation"]
G1["∂L/∂W_out[:,w] = (P(w|context) - y_w) × h<br/>y_w = 1 if w=target, else 0"]
G2["∂L/∂W_in = (1/C) Σ x_c × (Σ(P(w) - y_w) × W_out[:,w])^T"]
end
L --> G1
L --> G2
subgraph Update["Weight Update (Gradient Descent)"]
U["W_new = W_old - η × ∂L/∂W<br/>η = learning rate (typically 0.01-0.025)<br/><br/>Repeat for all training examples"]
end
G1 --> U
G2 --> U
U -.->|"Update weights<br/>iteratively"| W1
U -.->|"Update weights<br/>iteratively"| W2
style Input fill:#e1f5ff
style Hidden fill:#fff4e1
style Output fill:#e8f5e9
style Loss fill:#ffebee
style Gradient fill:#f3e5f5
style Update fill:#fff9c4
style Weights1 fill:#e0e0e0
style Weights2 fill:#e0e0e0
Overview of Word2Vec CBOW Architecture
Overview of A Simple Multi-Headed Attention Mechanism
Walking Through a Transformer: Overview
Let’s trace our example sentence “The cat sat” through one complete transformer layer. We’ll see how variable-length input becomes fixed-size output. The input has 3 words, but could have been 5 words or 50 words. The transformer processes each word position independently using the same learned weights. Each word becomes a fixed-size vector (embedding), and the transformer operations preserve this fixed size. The key insight: attention allows each word to look at all other words, regardless of how many there are. The output is always the same dimension for each word position.
Step 1: Tokenization and Embeddings
We start with “The cat sat” and convert each word to a token ID. “The” becomes ID 156, “cat” becomes 892, “sat” becomes 1043. Each ID is converted to an embedding vector—let’s use 4 dimensions for simplicity. “The” becomes [0.2, 0.5, -0.1, 0.3] after adding positional encoding for position 0. “cat” becomes [0.8, -0.2, 0.4, 0.1] at position 1. “sat” becomes [0.3, 0.7, -0.5, 0.2] at position 2. Notice: regardless of sentence length, each word gets exactly 4 numbers. If we had 100 words, we’d have 100 vectors of size 4. The embedding dimension is fixed by the model architecture.
Step 2: Multi-Head Attention - Computing Q, K, V
For each attention head, we compute Query, Key, and Value matrices. Head 1 multiplies each embedding by learned weight matrices: \(Q = E × W_Q, K = E × W_K, V = E × W_V\). These weight matrices have fixed dimensions (4×4 in our example), so regardless of input length, each word’s Q, K, V vectors are always 4 dimensions. For “sat” (position 2), Head 1 computes Q₃ = [0.5, 0.1, -0.3, 0.4]. Head 2 uses different weight matrices and computes different Q, K, V vectors. The crucial point: the same weight matrices process position 0, position 1, position 2, or position 1000—the operation is identical.
Step 3: Computing Attention Scores
Now we compute how much “sat” should attend to each word. We take the query for “sat” (\(Q₃\)) and compute dot products with all keys. \(Q₃ · K₁ᵀ\) (for “The”) \(= 0.5×0.2 + 0.1×0.5 + (-0.3)×(-0.1) + 0.4×0.3 = 0.28\). \(Q₃ · K₂ᵀ\)(for “cat”) = 2.1 (high score—strong relationship). \(Q₃ · K₃ᵀ\) (for “sat”) = 0.8 (moderate self-attention). We divide by \(√4 = 2\) to get scaled scores: \(0.14, 1.05, 0.4\). This works for any number of words. If we had 50 words, we’d compute 50 dot products—the computation scales but the vector dimensions stay fixed at 4.
Step 4: Softmax and Weighted Sum
We apply softmax to convert scores to probabilities. For Head 1, “sat” gets attention weights: 0.2 for “The”, 0.6 for “cat”, 0.2 for “sat” (they sum to 1). Now we compute a weighted sum of the Value vectors. \(Output = 0.2×V₁ + 0.6×V₂ + 0.2×V₃.\) If \(V₁=[1.0, 0.5, -0.2, 0.3]\), \(V₂=[0.8, 1.2, 0.4, -0.1]\), \(V₃=[0.5, 0.7, 0.1, 0.6]\), then \(Output₁ = [0.76, 0.96, 0.22, 0.06]\). This output is still 4 dimensions. For a 50-word sentence, we’d weight 50 vectors instead of 3, but the output dimension remains 4. Variable input length, fixed output dimension per word.
Step 5: Combining Multiple Heads
Head 2 performs the same process with different weight matrices. It might find different relationships—perhaps “sat” attends more to “The” (0.5) than to “cat” (0.3). Head 2 produces \(Output₂ = [0.35, 0.82, -0.15, 0.44]\), also 4 dimensions. We concatenate both heads: \([0.76, 0.96, 0.22, 0.06, 0.35, 0.82, -0.15, 0.44]\), giving us 8 dimensions. We then apply a linear transformation (multiply by \(W_O\) matrix) to project back to 4 dimensions: \([0.5, 0.3, -0.2, 0.8]\). This is the attention output for “sat”. Each of the 3 words gets its own 4-dimensional attention output through the same process.
Step 6: Residual Connection and Layer Normalization
We add the attention output back to the original embedding (residual connection). For “sat”: \([0.5, 0.3, -0.2, 0.8] + [0.3, 0.7, -0.5, 0.2] = [0.8, 1.0, -0.7, 1.0]\). Then we apply layer normalization to stabilize training. This computes mean and standard deviation across the 4 dimensions and rescales. Suppose we get \([0.71, 0.85, -0.64, 0.85]\) after normalization (still 4 dimensions). This normalized output becomes input to the feed-forward network. Residual connections help gradients flow during training and preserve information from earlier layers.
Step 7: Feed-Forward Network - Expansion
The feed-forward network processes each word position independently. For “sat”’s normalized vector \([0.71, 0.85, -0.64, 0.85]\), we first expand dimensions. Layer 1 computes: \(h = ReLU([0.71, 0.85, -0.64, 0.85] × W₁ + b₁)\), where \(W₁\) is 4×16. This expands to 16 dimensions: \([1.2, 0.0, 2.1, 0.5, 3.2, 0.0, 1.8, ..., 0.5]\). ReLU sets negative values to zero. The expansion allows the network to learn complex non-linear transformations. This same \(W₁\) matrix processes every word position. Whether we have 3 words or 300, each word’s 4-dimensional vector gets expanded to 16 dimensions using the same weights.
Step 8: Feed-Forward Network - Projection
The second layer projects back to the original dimension. Layer 2 computes: \(y = [1.2, 0.0, 2.1, ..., 0.5] × W₂ + b₂\), where \(W₂\) is 16×4. This produces \([0.6, 0.4, -0.1, 0.9]\), back to 4 dimensions. Another residual connection: \([0.6, 0.4, -0.1, 0.9] + [0.71, 0.85, -0.64, 0.85] = [1.31, 1.25, -0.74, 1.75]\). After layer normalization, we get the final output for “sat”: \([0.6, 0.4, -0.1, 0.9]\). This is the hidden state that represents “sat” after passing through one complete transformer layer. All three words now have updated 4-dimensional representations.
Step 9: Why Variable Input Produces Fixed Output
Every operation preserves dimensionality per word. We start with 3 words × 4 dimensions = 12 numbers total. After attention: still 3 words × 4 dimensions = 12 numbers. After feed-forward: still 3 words × 4 dimensions = 12 numbers. If we had 50 words, we’d have 50 × 4 = 200 numbers throughout. The learned weight matrices (W_Q, W_K, W_V, W₁, W₂, etc.) have fixed dimensions and apply identically to each word position. Attention allows words to communicate across positions, but each word maintains its 4-dimensional representation. The model can process any sentence length using the same weights.