Attention is all you need
Introduction
 This paper review is following the blog from Jay Alammar’s blog on the Illustrated Transformer. The blog can be found here.
Paper Introduction

New architecture based solely on attention mechanisms called Transformer. Gets rids of recurrent and convolution networks completely.

Generally, RNN used to seqtoseq tasks such as translation, language modelling, etc.

Transformer allows for significant parallelization and relies only on attention.
Background
 Self attention Attention to different positions of a sequence in order to compute a representation of the sequence.
Model Architecture

Transformer uses the following:

Encoder decode mechanism

Stacked self attention

Point wise fully connected layer for encoder and decoder

Encoder and decoder stacks

Encoder: 6 identical layers. 2 sub layers per layer

First: multihead self attention mechanism

Second: Fully connected feed forward network

Apply residual connection for each of the two laters

Apply layer normalization

Decoder: 6 identical layers. 2 sub layers as above + 1 more which performs multihead attention over output of encoder stack

Residual blocks: Present around all 3 sub layers

Layer normalization: Normalizes input across features instead of normalizing input features across batch dimension(i.e in batch normalization). There is a great overview of normalization layers available by Akash Bindal here.

Modify selfattention sub layer to prevent positions from attending to subsequent positions. Ensures that i output depends only on words before i.
Attention

3 vectors: Query(Q), Key(K) and Value(V)

Output = Weighted sum of values. Weights assigned as a function of query with key.

Scaled dotproduct attention and multihead attention

Attention is calculated as:
$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$ 
Dot product attention is faster and more spaceefficient than additive attention.
Multi head attention

Using multile q, k and v vectors. Get the final output, concatenate them and get another final projection $d_{v}$.
$MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O \\ \text{where } head_i = Attention(QW_{i}^{Q}, KW_{i}^{K},VW_{i}^{V})$ 
Dimensions of the key and value matrices will be: $d_{k} = d_{v} = d_{model}/h = 64$
Applications of attention

Encoderdecoder attention: Q from previours decoder, K and V from output of decoder. Attend to all positions in the input sequence.

Encoder: Self attentnion laters. Q,K and V from output of previous layer in the encoder. Some talk about leftward flow, didn’t really understand this bit. Will come back to this in sometime.
Positionwise FeedForward Networks

Each layer contains feedforward network.
$FFN(x) = max(o, xW_1,+ b_1)W_2 + b_2$
Embeddings and Softmax

Convert input and output string to vectors of dim $d_{model}$

Share weight matrix between two embedding layers and the presoftmaax linear transformation
Positional Encoding

Encode positions of the tokens for the input and output.

Same vector size i.e $d_{model}$
$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})$ 
Might allow approximation of longer sequence lenghts than seen in the training set
Why self attention?

Total computational complexity per layer

Parallel Computation

Path length between longrange dependencies in the network.
Training
Optimizer

Use Adam. Vary learning rate according to formula: $lrate = d_{model}^{0.5} . min(step_num^{0.5}, step_num . warmupsteps^{1.5})$

Increase LR for warmup steps, then decrease propotionally to inverse square root of step number. Warmup steps = 4000
Regularization

Residual Dropout

Label Smoothing: Instead of using 0 and 1 as class labels, allow for some uncertainity in the prediction, and use values like 0.1 and 0.9 for the classes
Conclusion

This was the first model based entirely on attention. It acheived SOTA results on Machine Translation and English contituency parsing.

Admittedly, there are still a lot of bits I don’t really understand. Specially around self attention. I will give this paper another read after going through Jay Alammar’s blog.