Transformers Architecture Explained

Beginner 11 min read

What you'll learn

✓How tokens become embeddings
✓How self-attention scores relationships
✓Why multi-head attention matters
✓The role of residual streams and layer norm
✓How decoders produce next-token logits

Prerequisites

•Basic Python familiarity

The transformer is the architecture behind almost every modern large language model. It looks intimidating in diagrams, but the core idea is simple: every layer lets each position in the sequence look at every other position, mix in what is useful, and pass the result forward. Once you internalize that loop, the rest is bookkeeping.

From text to tokens to vectors

Models do not read characters or words directly. A tokenizer splits text into subword pieces and assigns each piece an integer id. The string “transformers” might become a single token, while “Codeloom” might split into two or three. Each id then indexes a learned embedding table, producing a dense vector for every position.

To that vector we add a positional encoding. Attention is permutation-invariant on its own, so without position information the model would not know that “dog bites man” differs from “man bites dog”. Older models used sinusoidal encodings; modern ones often use rotary embeddings (RoPE) applied inside attention itself.

import torch
import torch.nn as nn

vocab_size = 32000
d_model = 512
seq_len = 8

tok_emb = nn.Embedding(vocab_size, d_model)
pos_emb = nn.Embedding(seq_len, d_model)

ids = torch.randint(0, vocab_size, (1, seq_len))
positions = torch.arange(seq_len).unsqueeze(0)
x = tok_emb(ids) + pos_emb(positions)
print(x.shape)  # torch.Size([1, 8, 512])

Self-attention in one paragraph

Inside each layer the model projects every vector into three views: a query, a key, and a value. Think of queries as questions, keys as labels on filing cabinets, and values as the documents inside. For each position the model takes its query and dot-products it with every key in the sequence, scales the result, applies softmax to get weights, and then takes a weighted average of the values. The output is a new vector that contains information gathered from the rest of the sequence, weighted by relevance.

The “scaled dot product” part is just a numerical trick. Without scaling, dot products grow with dimension and push softmax into saturated regions where gradients vanish.

import math
import torch.nn.functional as F

def attention(q, k, v, mask=None):
    d_k = q.size(-1)
    scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ v

Why multi-head attention

A single attention operation can only learn one mixing pattern per layer. Multi-head attention splits the embedding dimension into several smaller chunks and runs attention independently on each, then concatenates the results. One head might learn to track subject-verb agreement, another might focus on long-range coreference, another on local punctuation. The model never explicitly assigns these jobs, but with enough data and capacity heads specialize.

Typical configurations use 8 to 96 heads, with each head working in a dimension of 64 or 128.

Feed-forward layers and residuals

After attention mixes information across positions, a per-position feed-forward network mixes information across features. It is two linear layers with a nonlinearity between them, usually a GELU or SwiGLU. The hidden dimension is usually four times the model dimension. This is where a lot of the model’s “knowledge” is actually stored, even though attention gets most of the attention.

Every sublayer is wrapped in a residual connection and a layer normalization. The residual stream is the highway down which information flows; each layer reads from it, transforms a piece of what it read, and writes a delta back. Layer norm keeps activations well-scaled so the model trains stably.

Causal masking and the decoder

For a language model that predicts the next token, position t must not see positions greater than t. A causal mask is a lower-triangular matrix that zeroes the attention weights for future positions. With the mask in place, the same architecture trains in parallel across the whole sequence yet behaves autoregressively at inference time.

After the final transformer block, a layer norm and a linear projection map each position back to a vector of size vocab_size. A softmax turns those logits into a probability distribution over the next token. During training we compare against the true next token using cross-entropy loss. At inference we sample or pick the argmax and feed it back as input.

How encoder, decoder, and encoder-decoder differ

The original 2017 paper described an encoder-decoder model for translation. Encoder layers use bidirectional attention; decoder layers use causal self-attention plus a cross-attention block that reads from encoder outputs. Modern chat models like Claude and GPT are decoder-only: they treat the prompt and the response as one long sequence and use causal masking throughout. BERT-style models are encoder-only and used for classification or embeddings, not generation.

What to remember

Embeddings turn ids into vectors. Attention lets every position read from every other position. Multi-head attention runs several mixing patterns in parallel. Feed-forward layers store features. Residuals and norms keep training stable. Causal masks make the model autoregressive. Stack a few dozen of these blocks and you have a model that can write code, summarize papers, and hold a conversation. Everything beyond that, mixture-of-experts, longer context, better tokenizers, is engineering on top of this core.