# Introduction

- This paper review is following the blog from Jay Alammar’s blog on the
**Illustrated Transformer**. The blog can be found here.

# Paper Introduction

New architecture based solely on attention mechanisms called

**Transformer**. Gets rids of recurrent and convolution networks completely.Generally, RNN used to seq-to-seq tasks such as translation, language modelling, etc.

Transformer allows for significant parallelization and relies only on attention.

# Background

*Self attention*Attention to different positions of a sequence in order to compute a representation of the sequence.

# Model Architecture

Transformer uses the following:

Encoder decode mechanism

Stacked self attention

Point wise fully connected layer for encoder and decoder

## Encoder and decoder stacks

**Encoder**: 6 identical layers. 2 sub layers per layer*First*: multi-head self attention mechanism*Second*: Fully connected feed forward networkApply residual connection for each of the two laters

Apply layer normalization

**Decoder**: 6 identical layers. 2 sub layers as above + 1 more which performs multi-head attention over output of encoder stack*Residual blocks*: Present around all 3 sub layers*Layer normalization*: Normalizes input across features instead of normalizing input features across batch dimension(i.e in*batch normalization*). There is a great overview of normalization layers available by Akash Bindal here.Modify self-attention sub layer to prevent positions from attending to subsequent positions. Ensures that

*i*output depends only on words before*i*.

## Attention

3 vectors: Query(Q), Key(K) and Value(V)

Output = Weighted sum of values. Weights assigned as a function of query with key.

Scaled dot-product attention and multi-head attention

Attention is calculated as:

\[ Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V \]

Dot product attention is

**faster and more space-efficient**than additive attention.

## Multi head attention

Using multile q, k and v vectors. Get the final output, concatenate them and get another final projection \(d_{v}\).

$$ MultiHead(Q,K,V) = Concat(head_1,…,head_h)W^O \

`\text{where } head_i = Attention(QW_{i}^{Q}, KW_{i}^{K},VW_{i}^{V})`

$$

Dimensions of the key and value matrices will be: \(d_{k} = d_{v} = d_{model}/h = 64\)

## Applications of attention

**Encoder-decoder attention**: Q from previours decoder, K and V from output of decoder. Attend to all positions in the input sequence.**Encoder**: Self attentnion laters. Q,K and V from output of previous layer in the encoder. Some talk about leftward flow, didn’t really understand this bit. Will come back to this in sometime.

## Position-wise Feed-Forward Networks

Each layer contains feed-forward network.

\[ FFN(x) = max(o, xW_1,+ b_1)W_2 + b_2 \]

## Embeddings and Softmax

Convert input and output string to vectors of dim \(d_{model}\)

Share weight matrix between two embedding layers and the pre-softmaax linear transformation

## Positional Encoding

Encode positions of the tokens for the input and output.

Same vector size i.e \(d_{model}\)

$$ PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) \

`PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})`

$$

Might allow approximation of longer sequence lenghts than seen in the training set

## Why self attention?

Total computational complexity per layer

Parallel Computation

Path length between long-range dependencies in the network.

# Training

## Optimizer

Use Adam. Vary learning rate according to formula: \(lrate = d_{model}^{-0.5} . min(step_num^{-0.5}, step_num . warmupsteps^{-1.5})\)

Increase LR for warmup steps, then decrease propotionally to inverse square root of step number. Warmup steps = 4000

## Regularization

**Residual Dropout****Label Smoothing**: Instead of using 0 and 1 as class labels, allow for some uncertainity in the prediction, and use values like 0.1 and 0.9 for the classes

# Conclusion

This was the first model based entirely on attention. It acheived SOTA results on Machine Translation and English contituency parsing.

Admittedly, there are still a lot of bits I don’t really understand. Specially around self attention. I will give this paper another read after going through Jay Alammar’s blog.