Attention: The Only Thing You Need

2023-01-13

[[Pilot, crackling]] Attention, please! Fasten your seatbelts.

In Attention Is All You Need the Transformer architecture—built entirely on attention—was introduced, enabling massive parallelism and state-of-the-art quality across sequence transduction tasks. [1] Below are concise notes you can use as a primer or a refresher.

1. Sequence Transduction Models 🤖
2. Paper Review 📎
- 2.1 Model Architecture
3. Summary 📝
4. Transformer Families 🏛️
References

1. Sequence Transduction Models 🤖

Transduce has two relevant meanings: (i) convert one form of energy/message into another; (ii) transfer genetic material via a vector. [2]
In electronics, transducers convert physical input to electrical signals (e.g., microphones, thermometers). [3]

Sequence transduction converts input sequences to output sequences: speech-to-text, text-to-speech, machine translation, protein structure prediction, and more. [4]
The Transformer is the first sequence-to-sequence model built entirely on attention. [1]

2. Paper Review 📎

2.1 Model Architecture

Many competitive sequence models use an encoder–decoder design. The Transformer keeps that shape but replaces recurrence and convolution with attention. [1]

2.1.1 Encoder

Stack N = 6 identical layers. Each layer has:

Multi-Head Self-Attention → Add & LayerNorm (residual connection around the sublayer)
Position-wise Feed-Forward → Add & LayerNorm
All sublayers (including embeddings) use d_model = 512 so residuals are shape-compatible. Residual connections + LayerNorm stabilize training. [1, 5]

2.1.2 Decoder

Also N = 6 layers, but with three sublayers:

Masked Multi-Head Self-Attention (prevents peeking ahead)
Multi-Head Encoder–Decoder Attention (attend to encoder outputs)
Position-wise Feed-Forward
Each sublayer is wrapped with residual + LayerNorm. Masking ensures position i only depends on outputs < i. [1]

2.1.3 Attention

Given query q, keys K, values V, attention maps the pair set (K, V) with q to a weighted sum of values. Weights are computed by a compatibility function of q with each key k_i. [1]

2.1.4 Scaled Dot-Product Attention

With dimensions d_k and d_v, attention is: softmax( (Q Kᵀ) / √d_k ) . V

Dot-product attention is fast and memory-efficient thanks to optimized matmul; scaling by √d_k avoids softmax saturation. Additive (Bahdanau) vs dot-product are similar in theory; for small d_k additive can edge out, but in practice dot-product shines. [1]

2.1.5 Multi-Head Attention

Instead of one attention with d_model-sized Q/K/V, project to h subspaces (learned linear projections), apply attention in parallel, then concat → linear:

Typical paper setting: h = 8, each head uses d_k = d_v = 64, keeping total cost similar to single-head. [1]

2.1.6 Where Attention Is Used

Encoder self-attention: Q, K, V from the previous encoder layer.
Decoder self-attention: Q, K, V from the previous decoder layer (with causal mask).
Encoder–decoder attention: Q from decoder; K, V from encoder outputs—lets the decoder attend to the source sequence. [1]

2.1.7 Position-wise Feed-Forward Networks

Applied independently to each position: FFN(x) = max(0, xW₁ + b₁) W₂ + b₂. Paper used d_model = 512, d_ff = 2048. Can be seen as two conv1×1s. [1]

2.1.8 Embeddings and Softmax

Use learned input/output embeddings (both d_model) and a final linear + softmax for next-token prediction. The paper ties input/output weights and scales embeddings by √d_model. [1]

2.1.9 Positional Encoding

Because there’s no recurrence/convolution, add positional encodings to embeddings. The paper used sinusoidal encodings with geometrically increasing wavelengths (from ~2π to 10000·2π). These allow modeling relative positions and generalize to longer sequences. Learned positional encodings performed similarly; sinusoids were preferred for extrapolation. [1]

2.1.10 Why Self-Attention?

Targets: (i) lower per-layer complexity, (ii) more parallelism, (iii) shorter path lengths for long-range dependencies. Shorter paths ease learning long-term interactions. [6] In the paper’s comparisons, attention layers were both faster and more parallelizable than recurrent alternatives. [1]

2.1.11 Training Setup

Data: WMT14 En–De (≈4.5M sentence pairs, BPE vocab 32k), WMT14 En–Fr (≈36M).
Hardware: single machine, 8× NVIDIA P100.
Optimizer: Adam with warm-up;
Regularization: dropout (residual streams, embeddings, positional encodings) and label smoothing (improves BLEU & calibration). [1]

2.1.12 Results (BLEU)

BLEU evaluates MT by n-gram precision with a brevity penalty; many variants exist. [7–9] The Transformer achieved state-of-the-art BLEU on WMT’14 benchmarks at the time, with faster training via full parallelism. See also gentle explanations in The Illustrated Transformer. [11]

3. Summary 📝

Self-attention compares each token to all others, computes attention weights, and rewrites token representations as weighted mixtures. Multi-head runs several attention projections in parallel and combines them. [1, 11, 12]
Feed-forward networks then transform these representations non-linearly for the next layer or final output.

4. Transformer Families 🏛️

Transformer variants differ by architecture, pre-training objective, and data. A simple taxonomy:

Encoder-only (bi-directional, auto-encoding): BERT, RoBERTa, ALBERT, ELECTRA. Great for understanding: classification, NER, extractive QA. Pre-training often masks or corrupts input and reconstructs. [14]
Decoder-only (uni-directional, auto-regressive): GPT/2/3, CTRL, Transformer-XL. Next-token prediction objective; excels at generation. [15]
Encoder–decoder (seq2seq): T5, BART, Marian, XLNet (permute), GPT-3 variants with adapters. Suited for tasks that map input → output text (translation, summarization, generative QA). [16, 13]

[[Pilot, again]] Attention! You’ve reached your destination. \

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 30.
[2] transduce, Merriam-Webster Dictionary. https://www.merriam-webster.com/dictionary/transduce
[3] Transducer — Definition, Parts, Types, Applications. BYJU’S. https://byjus.com/physics/transducer/
[4] Graves, A. (2012). Sequence Transduction with Recurrent Neural Networks. ICML Representation Learning Workshop (slides).
[5] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR.
[6] Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient Flow in Recurrent Nets: the difficulty of learning long-term dependencies.
[7] Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. ACL ’02.
[8] “Bleu Score,” Neurotic Networking (blog). https://necromuralist.github.io/Neurotic-Networking/posts/nlp/bleu-score/
[9] BLEU — Wikipedia. https://en.wikipedia.org/wiki/BLEU
[10] Evaluating models, Google Cloud AutoML Translation docs. https://cloud.google.com/translate/automl/docs/evaluate
[11] Alammar, J. (2018). The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/
[12] Wolfe, C. R. Vision Transformers (explainer). https://cameronrwolfe.substack.com/
[13] How do Transformers work? — Hugging Face Course. https://huggingface.co/course/chapter1/4
[14] Transformer models (encoder-only) — Hugging Face Course. https://huggingface.co/course/chapter1/5
[15] Decoder models — Hugging Face Course. https://huggingface.co/course/chapter1/6
[16] Sequence-to-sequence models — Hugging Face Course. https://huggingface.co/course/chapter1/7