Attention is all you need
Introduction
- This paper review is following the blog from Jay Alammar’s blog on the Illustrated Transformer. The blog can be found here.
Paper Introduction
-
New architecture based solely on attention mechanisms called Transformer. Gets rids of recurrent and convolution networks completely.
-
Generally, RNN used to seq-to-seq tasks such as translation, language modelling, etc.
-
Transformer allows for significant parallelization and relies only on attention.
Background
- Self attention Attention to different positions of a sequence in order to compute a representation of the sequence.
Model Architecture
-
Transformer uses the following:
-
Encoder decode mechanism
-
Stacked self attention
-
Point wise fully connected layer for encoder and decoder
-
Encoder and decoder stacks
-
Encoder: 6 identical layers. 2 sub layers per layer
-
First: multi-head self attention mechanism
-
Second: Fully connected feed forward network
-
Apply residual connection for each of the two laters
-
Apply layer normalization
-
Decoder: 6 identical layers. 2 sub layers as above + 1 more which performs multi-head attention over output of encoder stack
-
Residual blocks: Present around all 3 sub layers
-
Layer normalization: Normalizes input across features instead of normalizing input features across batch dimension(i.e in batch normalization). There is a great overview of normalization layers available by Akash Bindal here.
-
Modify self-attention sub layer to prevent positions from attending to subsequent positions. Ensures that i output depends only on words before i.
Attention
-
3 vectors: Query(Q), Key(K) and Value(V)
-
Output = Weighted sum of values. Weights assigned as a function of query with key.
-
Scaled dot-product attention and multi-head attention
-
Attention is calculated as:
-
Dot product attention is faster and more space-efficient than additive attention.
Multi head attention
-
Using multile q, k and v vectors. Get the final output, concatenate them and get another final projection $d_{v}$.
-
Dimensions of the key and value matrices will be: $d_{k} = d_{v} = d_{model}/h = 64$
Applications of attention
-
Encoder-decoder attention: Q from previours decoder, K and V from output of decoder. Attend to all positions in the input sequence.
-
Encoder: Self attentnion laters. Q,K and V from output of previous layer in the encoder. Some talk about leftward flow, didn’t really understand this bit. Will come back to this in sometime.
Position-wise Feed-Forward Networks
-
Each layer contains feed-forward network.
Embeddings and Softmax
-
Convert input and output string to vectors of dim $d_{model}$
-
Share weight matrix between two embedding layers and the pre-softmaax linear transformation
Positional Encoding
-
Encode positions of the tokens for the input and output.
-
Same vector size i.e $d_{model}$
-
Might allow approximation of longer sequence lenghts than seen in the training set
Why self attention?
-
Total computational complexity per layer
-
Parallel Computation
-
Path length between long-range dependencies in the network.
Training
Optimizer
-
Use Adam. Vary learning rate according to formula: $lrate = d_{model}^{-0.5} . min(step_num^{-0.5}, step_num . warmupsteps^{-1.5})$
-
Increase LR for warmup steps, then decrease propotionally to inverse square root of step number. Warmup steps = 4000
Regularization
-
Residual Dropout
-
Label Smoothing: Instead of using 0 and 1 as class labels, allow for some uncertainity in the prediction, and use values like 0.1 and 0.9 for the classes
Conclusion
-
This was the first model based entirely on attention. It acheived SOTA results on Machine Translation and English contituency parsing.
-
Admittedly, there are still a lot of bits I don’t really understand. Specially around self attention. I will give this paper another read after going through Jay Alammar’s blog.