Beginners Guide To Transformer Models

In our previous post, we presented an introduction to Seq2Seq models – models that take a sequence as an input and produce a sequence for their output. Back in 2014, they revolutionized the field of Natural Language Processing (NLP), especially in translation applications. However, Machine Learning is a quickly developing discipline, and it wasn’t too long since even more groundbreaking work built on the shoulders of giants came along. Today we will be talking about Transformer models, as well as all the hype surrounding BERT and, most recently, GPT-3.

The defining feature of Seq2Seq models is that they work on recurrent networks, analyzing input data sequentially. This leads to the vanishing gradient problem when the context is mostly lost by the end of a long sentence or paragraph. Even “Long” Short Term Memory isn’t that long to outlast this problem. What truly allowed Seq2Seq models to shine was the attention mechanism. The Transformer Model, roughly speaking, is based solely on attention, taking this mechanism to the next level. Let’s dive in.


Structure of a Transformer Model

At the surface level, a Transformer model starts similarly to a Seq2Seq model. You have a black box that consists of an encoder and a decoder, with data passing between them. In the original paper, the “encoder” itself is a stack of six encoder blocks – each of them identical, but not sharing any weights. Likewise, the decoder is a stack of six identical decoder blocks.

Each encoder block is built from 2 layers – a self-attention layer, which passes its output to a Feed Forward Neural Network. This network, in turn, passes its output to the next encoder block. The output of the last encoder block is transferred to the decoder.

The decoder blocks are structured similarly, although they have an additional layer in the middle. There’s a self-attention layer, an “encoder-decoder attention” layer, and a feed-forward layer.

As you’ve probably noticed, this model has a lot of different attention layers. In fact, there are three types that we’ve mentioned:

  1. Encoder self-attention. It helps the encoder look at (or “pay attention to”) other words in the input sequence.
  2. Decoder self-attention. Same idea, only this time it’s the decoder looking at other words in the output.
  3. Encoder-decoder attention. This is similar to what we saw in Seq2Seq models – a mechanism for the decoder to look at words in the input for better context.

The original paper takes its title seriously – attention really is all you need.

Mathematically, attention is calculated via a few simple matrix multiplications and the softmax function.


  1. Q is our “query” matrix – a representation of a single word in the sequence.
  2. K stands for “keys” – vector representations of all the words in the sequence.
  3. V – the “values” in a “key-value” pair.

The multiplication of Q and K is an indicator of how each word is influenced by other words in the sequence, and after normalization, we receive weights that are applied to V. For an even more beneficial approach, the paper introduces “multi-head attention”, which is achieved by repeating the attention mechanism multiple times in parallel with linear projections of Q, K, and V.

One other important thing to point out is that since we’re not using recurrent networks, we’re missing information about the relative position of our elements – the order of words in a sentence. Just because we’re using a different approach doesn’t mean that we can ignore sentence structure and produce utter gibberish as a translation. To avoid this, the positions of each word are added to their embedded representation, like a time-stamp.


Introducing BERT

Bidirectional Encoder Representation from Transformers, or BERT for short, is a state of the art language model introduced by Google in 2018. It caused quite a stir in the community because of its impressive results in tasks like Question Answering, Natural Language Inference, and others.

The “bidirectional” in BERT’s name comes from the fact that by using the Transformer model, BERT doesn’t look at text left-to-right or right-to-left, but takes it in all at the same time. Perhaps calling it “non-directional” would be more technically accurate, but BERT sounds pretty cool so nobody’s complaining.

Expanding on the attention novelties brought about by the Transformer model, BERT sets out to generate a language model – for this, only an encoder is necessary. The original paper also introduced two new approaches required for generative purposes:

  • Masked Language Model (MLM). When an input is fed into the model, roughly 15% of the words are replaced with a MASK token. The model then attempts to predict the words hidden behind those masks based on the context of the remaining words. This approach leads to slower convergence during training but a higher context awareness.
  • Next Sentence Prediction (NSP). This technique focuses on figuring out the order in which sentences come after each other. During the training process, the model receives pairs of sentences. In 50% of the cases, the sentences are subsequent, while in the other 50% random sentences are chosen. The model then attempts to figure out whether the sentences are connected or not.

The BERT model trains to minimize the combined loss function of these two strategies. If you’re interested in this project, you can find out more on its source page.



OpenAI has been blowing people’s minds with their achievements over the last few years, and recently they presented what could possibly be their biggest hit yet. GPT-3 is a state-of-the-art language model made up of 175 billion (yes, you read that correctly) parameters. Their previous version, GPT-2 included only 1.5 billion parameters and the largest language model released previously by Microsoft had 17 billion.

As stated by the researchers themselves: “GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.” The researchers also added: “We find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.”

Because of its recency, it’s hard to gauge the exact impact GPT-3 will have on the field of NLP and machine learning in general. For now, we can say that the community is filled with heated discussions, with opinions varying from praise to concerns about what GPT means for society.

For a series of cool videos explaining how the model works, be sure to check out this Twitter thread:

You can also follow the developments of the GPT-3 project on its source page.


Learn more about Machine Learning with Abacus.AI

Be sure to check out our other articles where we cover all sorts of machine learning topics – Meta-Learning, Generative Adversarial Networks, Anomaly detection, and more. You can also contact us if you want to see how these techniques work in practice and how they can help you grow your business today.

Related posts

Machine Learning Meets Optimization


AI Agents - Build and Host LLM Apps At Scale


Data LLM: Get insights from your data


Giraffe - Long Context LLMs

Leave a Reply

%d bloggers like this: