• REALM is a paper mentioned in the T5 paper titled: How Much Knowledge Can You Pack Into The Parameters of a Language Model?

  • TLDR: This paper retrieves documents that have the information present while solving Question-Answer type problems.

    NOTE: This post is more like my running notes while reading the paper than a comprehensive blog. I will update this blog once I learn a little more about the transformer architecture.

  • Introduced a latent knowledge retriever, which can attend and retrieve documents over large corpus and can be trained in unsupervised manner using masked language modelling technique and backprop through retreiver which considers lots of docs.

    Training process for

  • Key point: Train retriever using a performance-based signal from unsupervised text.

  • Retrieval based LM => Moar computational resources => Moar money

    • Solution: Computation performed for each doc is cached and can be used again. Best doc selected using Maximum Inner Product Search(MIPS). Read the paper here.
  • REALM retriever can be used on downstream tasks via transfer learning.

  • REALM is SOTA on NQ-Open, WQ and CuratedTrec.


Retreive-then-predict generative process

  • Training: Masked-LM. Fine-tuning: Open QA task

  • Computing chance of the document given a question decomposed into two steps:

    • Function to be computed: p(yx)p(y\|x)

    • Given xx,retrive documents zz from corpus ZZ. Modelled as: p(zx)p(z\|x)

    • Condition of both zz and xx to generate output yy i.e p(yz,x)p(y\|z, x)

    • Overall likelihood yy is generated by treating zz as latent variable and marginalizing over all documents zz

      p(yx)=zϵZp(yz,x)p(zx)p(y\|x) = \sum_{z \epsilon Z} p(y\|z, x) * p(z\|x)


  • Neural Knowledge Retriever which models the distribution: $p(z|x)$

  • Knowledge Augmented Encoder which models the distribution p(yz,x)p(y\|z, x)

Neural Knowledge Retriever

  • Dense inner product model.

    p(zx)=exp(f(x,z))zexp(f(x,z))f(x,z)=Embedinput(x)TEmbeddoc(z)\begin{aligned} p(z\|x) = \frac{exp(f(x,z))}{\sum_{z'}{exp(f(x,z'))}} \\ f(x,z) = Embed_{input}(x)^TEmbed_{doc}(z) \end{aligned}
  • EmbedinputEmbed_{input} and EmbeddocEmbed_{doc} are embedding functions

  • f(x,z)f(x,z) is called relevance score. It is inner product of vector embeddings.

  • Relevant Distribution is softmax over all relevance scores

  • Embedding implement using BERT-style transformers. Join using <SEP>, prefix using <CLS> and append <SEP> as the end token. joinBERT(x)=[CLS]x[SEP]joinBERT(x1,x2)=[CLS]x1[SEP]x2[SEP]\begin{aligned} \\ join_{BERT}(x) = [CLS]x[SEP] \\ join_{BERT}(x_1, x_2) = [CLS]x_1[SEP]x_2[SEP] \end{aligned}

  • Pass above into transformer, which gives over vector for each token. Perform linear projection to reduce dimensionality of vector Embedinput(x)=WinputBERTCLS(joinBERT(x))Embeddoc(z)=WdocBERTCLS(joinBERT(ztitle,zbody))\begin{aligned} \\ Embed_{input}(x) = W_{input}BERT_{CLS}(join_{BERT}(x)) \\ Embed_{doc}(z) = W_{doc}BERT_{CLS}(join_{BERT}(z_{title}, z_{body})) \end{aligned}

Knowledge-Augmented Encoder

  • Given input xx and relevant doc zz, this defines p(yz,x)p(y\|z,x)

  • Join xx and zz into single sequence and feed into transformer

  • Here, training is different for pre-training vs fine-tuning

    • For pre-training, predict [MASK] token. Use same Masked LM(MLM) loss as in Transformer(Devlin et al.)

    • For Open-QA, we need to produce string yy.
    • Assumption: yy occurs as sequence of tokens in some document in the corpus.


  • Compute gradients in θ\theta and ϕ\phi and optimize using SGD.

  • Challenge: Computing p(yx)p(y\|x)

  • Approx by summing over top kk documents with highest prob under p(zx)p(z\|x)

  • Question: How to find top kk docs? Answer: Use MIPS

  • Need to precompute Embeddoc(x)Embed_{doc}(x) for all docs. Problem? It changes with each step of SGD.

  • Solution: Async refresh $Embed_{doc}$ every 500 steps

  • Use MIPS to select top $k$ docs. For these docs, recompute $p(z|x)$ using new $\theta$.

Implementing async MIPS refreshes

  • Two jobs running in parallel:

    • Primary trainer: Perform gradient updates on parameters

    • Secondary index builder: Embeds and indexes the docs

      Async MIPS

    • Async refresh used only for pre-training

    • For fine tuning, build index once from pre-trained $\theta$ and use it.

What does retriever learn?

  • Retriever promotes docs that improve accuracy

  • This can be analyzed by analyzing gradient wrt the parameters

Injecting inductive biases into pre-trianing

  • Salient span masking: Some questions require only local context. Select named entities and dates and mask one of them. Performs better.

  • Null document: Add null document to top kk documents to allow answers even when no context is required

  • Prohibiting trivial retrievals: If knowledge corpus ZZ is the same as pre-training corpus XX, it can predict yy by looking at xx in zz. Exclude trivial candidate

  • Initialization: Warm up EmbedinputEmbed_{input} and EmbeddocEmbed_{doc} using Inverse Cloze Task(ICT) i.e model trained to retrieve the doc where the sentence came from.


  • REALM outperforms all approaches by a big margin.

Future Work

  • Structured knowledge where we learn entities which are informative

  • Multi lingual setting. Retreiving knowledge in high resource language to better represent text in low resource language

  • Multi model setting. Retrieve images or videos that can provide knowledge not present in text


Overall, I enjoyed reading this paper. However, there are two key points that concern me:

  • The authors mention using MIPS for selecting the top kk documents, in order to simplify the task. However, would selecting only these documents from the entire dataset not lead to some information loss? I would like to see more experiments around this area.
  • There are no experiments around trying out larger models. While I agree that T5 is the largest model available right now, there is no evidence given that a model larger than T5-large would not perform better than the current REALM model. I would like to see some more exploration around this area.


There are a number of other resources you can use to learn more about this paper such as:

  • The original paper available here
  • Tweet summary by Adam Roberts available here
  • Video summary by Václav Košař available here
  • Huggingface Reading group summary by Joe Davidson available here