Overview of Word2Vec CBOW Architecture

graph TD
    subgraph Input["Input Layer: One-Hot Encoded Context Words"]
        A1["Context Word 1<br/>'the'<br/>[0,0,0,1,0,...,0]<br/>V-dimensional"]
        A2["Context Word 2<br/>'economy'<br/>[0,0,0,0,1,...,0]<br/>V-dimensional"]
        A3["Context Word 3<br/>'strong'<br/>[0,1,0,0,0,...,0]<br/>V-dimensional"]
        A4["Context Word 4<br/>'and'<br/>[0,0,1,0,0,...,0]<br/>V-dimensional"]
    end
    
    subgraph Weights1["Weight Matrix W_in (V × N)"]
        W1["Each row = input embedding<br/>for one word<br/>Dimensions: V × N<br/>(V = vocab size, N = embedding dim)"]
    end
    
    A1 -->|"v₁ = W_in^T × x₁"| W1
    A2 -->|"v₂ = W_in^T × x₂"| W1
    A3 -->|"v₃ = W_in^T × x₃"| W1
    A4 -->|"v₄ = W_in^T × x₄"| W1
    
    subgraph Hidden["Hidden Layer: Average Context Vector"]
        H["h = (1/C) Σ vᵢ<br/>Average of context embeddings<br/>N-dimensional vector<br/>(No activation function)"]
    end
    
    W1 -->|"Extract & average<br/>word vectors"| H
    
    subgraph Weights2["Weight Matrix W_out (N × V)"]
        W2["Each column = output embedding<br/>for one word<br/>Dimensions: N × V"]
    end
    
    H -->|"u = W_out^T × h"| W2
    
    subgraph Scores["Score Layer"]
        S["u_w for each word w<br/>V scores (one per vocab word)<br/>u_w = W_out[:,w]^T × h"]
    end
    
    W2 --> S
    
    subgraph Softmax["Softmax Layer"]
        SM["P(w|context) = exp(u_w) / Σ exp(u_w')<br/>Convert scores to probabilities<br/>All probabilities sum to 1"]
    end
    
    S --> SM
    
    subgraph Output["Output: Predicted Word Probabilities"]
        O1["P('is') = 0.45"]
        O2["P('very') = 0.20"]
        O3["P('remains') = 0.15"]
        O4["P('...') = 0.20"]
    end
    
    SM --> O1
    SM --> O2
    SM --> O3
    SM --> O4
    
    subgraph Loss["Loss Function"]
        L["L = -log P(w_target|context)<br/>= -u_w_target + log Σ exp(u_w)<br/><br/>Minimize negative log likelihood<br/>of correct target word"]
    end
    
    O1 -.->|"Compare to<br/>actual target"| L
    
    subgraph Gradient["Gradient Computation"]
        G1["∂L/∂W_out[:,w] = (P(w|context) - y_w) × h<br/>y_w = 1 if w=target, else 0"]
        G2["∂L/∂W_in = (1/C) Σ x_c × (Σ(P(w) - y_w) × W_out[:,w])^T"]
    end
    
    L --> G1
    L --> G2
    
    subgraph Update["Weight Update (Gradient Descent)"]
        U["W_new = W_old - η × ∂L/∂W<br/>η = learning rate (typically 0.01-0.025)<br/><br/>Repeat for all training examples"]
    end
    
    G1 --> U
    G2 --> U
    
    U -.->|"Update weights<br/>iteratively"| W1
    U -.->|"Update weights<br/>iteratively"| W2
    
    style Input fill:#e1f5ff
    style Hidden fill:#fff4e1
    style Output fill:#e8f5e9
    style Loss fill:#ffebee
    style Gradient fill:#f3e5f5
    style Update fill:#fff9c4
    style Weights1 fill:#e0e0e0
    style Weights2 fill:#e0e0e0

Overview of A Simple Multi-Headed Attention Mechanism

Walking Through a Transformer: Overview

Let’s trace our example sentence “The cat sat” through one complete transformer layer. We’ll see how variable-length input becomes fixed-size output. The input has 3 words, but could have been 5 words or 50 words. The transformer processes each word position independently using the same learned weights. Each word becomes a fixed-size vector (embedding), and the transformer operations preserve this fixed size. The key insight: attention allows each word to look at all other words, regardless of how many there are. The output is always the same dimension for each word position.

Step 1: Tokenization and Embeddings

We start with “The cat sat” and convert each word to a token ID. “The” becomes ID 156, “cat” becomes 892, “sat” becomes 1043. Each ID is converted to an embedding vector—let’s use 4 dimensions for simplicity. “The” becomes [0.2, 0.5, -0.1, 0.3] after adding positional encoding for position 0. “cat” becomes [0.8, -0.2, 0.4, 0.1] at position 1. “sat” becomes [0.3, 0.7, -0.5, 0.2] at position 2. Notice: regardless of sentence length, each word gets exactly 4 numbers. If we had 100 words, we’d have 100 vectors of size 4. The embedding dimension is fixed by the model architecture.

Step 2: Multi-Head Attention - Computing Q, K, V

For each attention head, we compute Query, Key, and Value matrices. Head 1 multiplies each embedding by learned weight matrices: \(Q = E × W_Q, K = E × W_K, V = E × W_V\). These weight matrices have fixed dimensions (4×4 in our example), so regardless of input length, each word’s Q, K, V vectors are always 4 dimensions. For “sat” (position 2), Head 1 computes Q₃ = [0.5, 0.1, -0.3, 0.4]. Head 2 uses different weight matrices and computes different Q, K, V vectors. The crucial point: the same weight matrices process position 0, position 1, position 2, or position 1000—the operation is identical.

Step 3: Computing Attention Scores

Now we compute how much “sat” should attend to each word. We take the query for “sat” (\(Q₃\)) and compute dot products with all keys. \(Q₃ · K₁ᵀ\) (for “The”) \(= 0.5×0.2 + 0.1×0.5 + (-0.3)×(-0.1) + 0.4×0.3 = 0.28\). \(Q₃ · K₂ᵀ\)(for “cat”) = 2.1 (high score—strong relationship). \(Q₃ · K₃ᵀ\) (for “sat”) = 0.8 (moderate self-attention). We divide by \(√4 = 2\) to get scaled scores: \(0.14, 1.05, 0.4\). This works for any number of words. If we had 50 words, we’d compute 50 dot products—the computation scales but the vector dimensions stay fixed at 4.

Step 4: Softmax and Weighted Sum

We apply softmax to convert scores to probabilities. For Head 1, “sat” gets attention weights: 0.2 for “The”, 0.6 for “cat”, 0.2 for “sat” (they sum to 1). Now we compute a weighted sum of the Value vectors. \(Output = 0.2×V₁ + 0.6×V₂ + 0.2×V₃.\) If \(V₁=[1.0, 0.5, -0.2, 0.3]\), \(V₂=[0.8, 1.2, 0.4, -0.1]\), \(V₃=[0.5, 0.7, 0.1, 0.6]\), then \(Output₁ = [0.76, 0.96, 0.22, 0.06]\). This output is still 4 dimensions. For a 50-word sentence, we’d weight 50 vectors instead of 3, but the output dimension remains 4. Variable input length, fixed output dimension per word.

Step 5: Combining Multiple Heads

Head 2 performs the same process with different weight matrices. It might find different relationships—perhaps “sat” attends more to “The” (0.5) than to “cat” (0.3). Head 2 produces \(Output₂ = [0.35, 0.82, -0.15, 0.44]\), also 4 dimensions. We concatenate both heads: \([0.76, 0.96, 0.22, 0.06, 0.35, 0.82, -0.15, 0.44]\), giving us 8 dimensions. We then apply a linear transformation (multiply by \(W_O\) matrix) to project back to 4 dimensions: \([0.5, 0.3, -0.2, 0.8]\). This is the attention output for “sat”. Each of the 3 words gets its own 4-dimensional attention output through the same process.

Step 6: Residual Connection and Layer Normalization

We add the attention output back to the original embedding (residual connection). For “sat”: \([0.5, 0.3, -0.2, 0.8] + [0.3, 0.7, -0.5, 0.2] = [0.8, 1.0, -0.7, 1.0]\). Then we apply layer normalization to stabilize training. This computes mean and standard deviation across the 4 dimensions and rescales. Suppose we get \([0.71, 0.85, -0.64, 0.85]\) after normalization (still 4 dimensions). This normalized output becomes input to the feed-forward network. Residual connections help gradients flow during training and preserve information from earlier layers.

Step 7: Feed-Forward Network - Expansion

The feed-forward network processes each word position independently. For “sat”’s normalized vector \([0.71, 0.85, -0.64, 0.85]\), we first expand dimensions. Layer 1 computes: \(h = ReLU([0.71, 0.85, -0.64, 0.85] × W₁ + b₁)\), where \(W₁\) is 4×16. This expands to 16 dimensions: \([1.2, 0.0, 2.1, 0.5, 3.2, 0.0, 1.8, ..., 0.5]\). ReLU sets negative values to zero. The expansion allows the network to learn complex non-linear transformations. This same \(W₁\) matrix processes every word position. Whether we have 3 words or 300, each word’s 4-dimensional vector gets expanded to 16 dimensions using the same weights.

Step 8: Feed-Forward Network - Projection

The second layer projects back to the original dimension. Layer 2 computes: \(y = [1.2, 0.0, 2.1, ..., 0.5] × W₂ + b₂\), where \(W₂\) is 16×4. This produces \([0.6, 0.4, -0.1, 0.9]\), back to 4 dimensions. Another residual connection: \([0.6, 0.4, -0.1, 0.9] + [0.71, 0.85, -0.64, 0.85] = [1.31, 1.25, -0.74, 1.75]\). After layer normalization, we get the final output for “sat”: \([0.6, 0.4, -0.1, 0.9]\). This is the hidden state that represents “sat” after passing through one complete transformer layer. All three words now have updated 4-dimensional representations.

Step 9: Why Variable Input Produces Fixed Output

Every operation preserves dimensionality per word. We start with 3 words × 4 dimensions = 12 numbers total. After attention: still 3 words × 4 dimensions = 12 numbers. After feed-forward: still 3 words × 4 dimensions = 12 numbers. If we had 50 words, we’d have 50 × 4 = 200 numbers throughout. The learned weight matrices (W_Q, W_K, W_V, W₁, W₂, etc.) have fixed dimensions and apply identically to each word position. Attention allows words to communicate across positions, but each word maintains its 4-dimensional representation. The model can process any sentence length using the same weights.

Step 10: Final Prediction - From Hidden State to Word

To predict the next word, we take only the last word’s hidden state. “sat” has hidden state \([0.6, 0.4, -0.1, 0.9]\). We multiply by a large vocabulary matrix W_vocab (4 × 50,000 dimensions): \([0.6, 0.4, -0.1, 0.9] × W_{vocab} + b\) gives 50,000 numbers called logits. Each logit represents how likely each vocabulary word is next. Suppose we get: logit(“on”) = 3.2, logit(“down”) = 2.8, logit(“there”) = 2.1. We apply softmax: P(“on”) = exp(3.2) / (exp(3.2) + exp(2.8) + … + exp(others)) ≈ 0.35. The model predicts “on” with 35% probability. During generation, we sample from this distribution and add the chosen word to the sequence, then repeat.

graph TD
    Input["Input Sentence: 'The cat sat'"]
    
    subgraph Tokenization
        T1["Token 1: 'The' → ID: 156"]
        T2["Token 2: 'cat' → ID: 892"]
        T3["Token 3: 'sat' → ID: 1043"]
    end
    
    subgraph Embeddings["Word Embeddings + Positional Encoding"]
        E1["Emb₁ + Pos₁ = [0.2, 0.5, -0.1, 0.3]"]
        E2["Emb₂ + Pos₂ = [0.8, -0.2, 0.4, 0.1]"]
        E3["Emb₃ + Pos₃ = [0.3, 0.7, -0.5, 0.2]"]
    end
    
    subgraph Transformer["Transformer Layer (Single Layer)"]
        
        subgraph MultiHead["Multi-Head Self-Attention"]
            
            subgraph Head1["Head 1"]
                QKV1["Q₁ = E × W_Q<br/>K₁ = E × W_K<br/>V₁ = E × W_V"]
                Score1["Scores = Q₁ × K₁ᵀ / √d_k<br/>Example: Q₃ · K₂ᵀ = 2.1"]
                Soft1["Softmax(Scores)<br/>Attention weights:<br/>'sat' → 'The': 0.2<br/>'sat' → 'cat': 0.6<br/>'sat' → 'sat': 0.2"]
                Out1["Output₁ = Weights × V₁<br/>Weighted sum of values"]
            end
            
            subgraph Head2["Head 2"]
                QKV2["Q₂ = E × W_Q'<br/>K₂ = E × W_K'<br/>V₂ = E × W_V'"]
                Score2["Scores = Q₂ × K₂ᵀ / √d_k<br/>Example: Q₃ · K₁ᵀ = 1.5"]
                Soft2["Softmax(Scores)<br/>Attention weights:<br/>'sat' → 'The': 0.5<br/>'sat' → 'cat': 0.3<br/>'sat' → 'sat': 0.2"]
                Out2["Output₂ = Weights × V₂<br/>Weighted sum of values"]
            end
            
            Concat["Concatenate Heads:<br/>[Output₁ || Output₂]"]
            Linear["Linear: Concat × W_O<br/>→ [0.5, 0.3, -0.2, 0.8]"]
        end
        
        AddNorm1["Add & Normalize:<br/>Output = LayerNorm(E + Attention)"]
        
        subgraph FFN["Feed-Forward Network"]
            FF1["Dense Layer 1 (expand):<br/>h = ReLU(x × W₁ + b₁)<br/>4 → 16 dimensions<br/>Example: [0.5, 0.3, -0.2, 0.8]<br/>→ [1.2, 0.0, 2.1, ..., 0.5]"]
            FF2["Dense Layer 2 (project):<br/>y = h × W₂ + b₂<br/>16 → 4 dimensions<br/>Example: [...16 values...]<br/>→ [0.6, 0.4, -0.1, 0.9]"]
        end
        
        AddNorm2["Add & Normalize:<br/>Output = LayerNorm(Input + FFN)"]
    end
    
    subgraph OutputLayer["Output Processing"]
        H1["Hidden₁: [0.3, 0.6, -0.2, 0.4]"]
        H2["Hidden₂: [0.7, 0.2, 0.3, 0.1]"]
        H3["Hidden₃: [0.6, 0.4, -0.1, 0.9]"]
    end
    
    Predict["Prediction Head:<br/>Logits = H₃ × W_vocab + b<br/>[0.6, 0.4, -0.1, 0.9] × W<br/>→ [50,000 values]"]
    
    Softmax["Softmax → Probabilities:<br/>P('on') = exp(3.2)/Z = 0.35<br/>P('down') = exp(2.8)/Z = 0.25<br/>P('there') = exp(2.1)/Z = 0.15<br/>..."]
    
    Final["Predicted Next Word: 'on'"]
    
    Input --> T1 & T2 & T3
    
    T1 --> E1
    T2 --> E2
    T3 --> E3
    
    E1 & E2 & E3 --> QKV1 & QKV2
    
    QKV1 --> Score1 --> Soft1 --> Out1
    QKV2 --> Score2 --> Soft2 --> Out2
    
    Out1 & Out2 --> Concat --> Linear
    
    Linear --> AddNorm1
    E1 & E2 & E3 -.residual.-> AddNorm1
    
    AddNorm1 --> FF1 --> FF2
    
    FF2 --> AddNorm2
    AddNorm1 -.residual.-> AddNorm2
    
    AddNorm2 --> H1 & H2 & H3
    
    H3 --> Predict --> Softmax --> Final
    
    style Input fill:#e1f5ff
    style Final fill:#fff4e1
    style Transformer fill:#f5f5f5
    style MultiHead fill:#ffe1f5
    style Head1 fill:#ffe8f0
    style Head2 fill:#ffe8f0
    style FFN fill:#e1ffe1
    style Predict fill:#fff9e1
    style Softmax fill:#fff9e1