Understanding Seq2Seq Models

Most of the machine learning applications you’ve likely heard about are concerned with processing data such as images or databases – their key characteristic being that they can be “taken in” by a learning model all at once. They don’t have any temporal properties. Today we’ll be talking about a different case – models that deal with data that is sequential by nature, text, and voice being several examples. These models take a sequence of data for the input and provide a different sequence at the output – Seq2Seq models, which were first introduced in 2014.

The most obvious example of a Seq2Seq application is a translation, and Google has been using this approach in its translator ever since 2016. Translating text word-to-word with a dictionary can provide you with a rough idea of the original, but it can only get you so far. In translation words can shift their position, disappear and appear out of nowhere, and sometimes even the meaning of a word relies heavily on its context. Seq2Seq models with the so-called attention mechanism are the way to deal with this problem, among many others.

“Lost in Sequences”

When working with data like text that comes in sentences and paragraphs, it’s almost impossible to know what the exact length of this data will be. While this variety and richness is a wonderful characteristic of language, it spells certain problems for deep learning models that need to have input and output vectors of fixed length. You have to pick the size of your layers, and having different models trained for all possible lengths of input and output is unimaginable.

Seq2Seq models deal with this problem in a manner similar to how backpropagation through time solves the task of training cyclic networks. With backpropagation, we take a temporal self-referential network and change it to a spatial non-self-referential network. We can do something similar for sequential data by reinterpreting a spatial problem (of variable length sequence) as a temporal one (with data generated over time). In other words, we feed our “black box” data as a string over time and pull out the data at the other end until we get a marker indicating that the string is finished.

Source: https://medium.com/@devnag/seq2seq-the-clown-car-of-deep-learning-f88e1204

The Structure of a Seq2Seq Model

Let’s move on from abstraction to the specifics of how a Seq2Seq model works. As with any machine learning model we have an encoder and a decoder. The encoder stores the context of the input in a hidden state vector and passes it on to the decoder. Because we’re working with sequential data, the encoder and decoder usually use some form of Recurrent Neural Network (RNN), Long-Short-Term Memory (LTSM), or Gated Recurrent Unit (GRU). The size of this hidden state vector is usually a rather large power of 2 – 256, 512, or 1024.

Source: https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263

RNNs are well suited to work with sequential data, but without any modifications, we will run into a similar problem as with the vanishing gradient during backpropagation. At each of its steps, an RNN takes the current input and a representation of the previous one – this way it can store information about the sequence and pass it on to the next instance of the model. However, after the encoder is finished, we pass only the final hidden state vector to the decoder. This way the output relies very heavily on the latest context and loses out a lot of information from the beginning of the sequence. This might be completely OK for short sentences since most of the context is still preserved, but remember that we want our Seq2Seq model to work for sequences of any length, so we need a workaround. This is where the concept of “Attention” comes in.

Introducing Attention

The mechanism of attention, at its core, is similar to how we process information by paying attention to different regions of an image or correlate words within a sentence. For example, in the following sentence, the word “eating” for us is strongly correlated to the word “apple”, despite them being several words apart.

Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

The implementation of attention is rather simple. Since passing a single hidden state vector (HSV) to the decoder means we’re missing out on a lot of potential information, we’ll just pass on as many HSVs as there were instances in the input sequence.

Let’s figure out how our decoder will use all these vectors. The most straightforward idea is to combine all the HSVs into a single ”context vector” as a weighted sum. After each time instance, we concatenate our context vector with the new HSV, and as a result, we pass on to the decoder a single vector that stores the entire context. Attention allows the decoder to “look” at the input sequence selectively and pick up on the more important parts instead of just the most recent ones. One final question that needs answering is how to decide on the weights with which we add up the HSVs – the attention scores.

Source: https://www.researchgate.net/figure/An-attention-based-seq2seq-model_fig3_329464533

These attention scores are determined in a separate neural network model called the alignment model. This model is trained in parallel to the encoder/decoder pair. After each instance of input, the alignment model compares that input with the previous output by matching the current HSV and the attention hidden state (the context that has been collected so far). Doing this comparison for every instance of the input sequence, we get our weights which eventually become attention scores after they’ve been normalized with a softmax function. With these scores, the decoder can now decide which parts of the input were the most important for the current output prediction being made.

Source: https://arxiv.org/pdf/1409.0473.pdf
Seq2Seq Modifications

What we’ve talked about works for sequential input data, but with a few modifications we can build models for applications such as image captioning. The output is still a sequence of words, so the decoder needn’t change, but the encoder would have to operate on a different principle. In order to transform a random image into a sequence of words all we have to do is replace the RNNs in an encoder with the appropriate type of Convolutional Neural Network (CNN) which is typically used in image processing. The CNN will encode the image to some hidden state vector, while the decoder will use that vector to output a corresponding description.

Source: https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
Learn more about Machine Learning with RealityEngines

Be sure to check out our other articles where we cover all sorts of machine learning topics – Meta-Learning, Generative Adversarial Networks, Anomaly detection, and more. You can also contact us if you want to see how these techniques work in practice and how they can help you grow your business today.