cover

Contextualized Embeddings with ELMo

ELMo states for Embeddings from Language Models. So first, what is a language model? A language model (LM) is a probabilistic statistical model that estimates the probability of linguistic units (words, sequences). Very briefly, in Natural Language Processing we use it for predicting the next word.

ELMo

There are three kind (two + one bonus) of language models:

  • Forward LM : predict the next word given the preceding ones
  • Backward LM : predict the next word given the following ones
  • Forward - Backward : predict the next word given both the preceding and the following ones.

The main goal of ELMo is to obtain contextualized word representations, by capturing the meaning from the contextual information. It employs a Deep Multi Layer Bidirectional Language Model, and letā€™s try to analyze it:

  • Deep : we are dealing with deep architectures, in particular with LSTM (Long Short Term Memory) having residual connections for improving the learning process and mitigating the vanishing gradient problem.
  • Multi Layer : the architecture will be multi layer, therefore weā€™ll have multiple layers that will deal with the input sentence from different abstraction levels. On the first one (from the bottom) we deal with a context independent CNN for getting the word level distributed vectors. Then, while moving upward we shift from syntactic-aware layers to semantic-aware layers.
  • Bidirectional LM: weā€™re dealing with a Bidirectional Language Model, therefore the ā€œnextā€ predicted word will be based on both preceding and following tokens.

Once both forward and language models have been trained, we freeze its parameters for the language model and for each new task we concatenate ELMoā€™s weights into a task specific model for combining the layer-wise representations into a single vector.

ELMo

Then, for fine tuning we specialize the word embeddings to the task weā€™re interested into (SA,NERā€¦). Finally, the concatenation will be multiplied with a weight based on the solved task (Ī³\gamma). However, fine tuning the biLM on domain specific data will lead to drops on the performances.

Now, for the first time we will employ all the hidden layers and not only the last one. And in order to combine both forward and language model we proceed as follows:

  1. Concatenate each internal stage (for the same level) and normalize.
  2. Multiply the normalized vectors with the weights learned during training.
  3. Sum these ā€œupdatedā€ internal states.
  4. Multiply for the learnable parameter "Ī³""\gamma".

To sum up:

ELMOktask=Ī³taskāˆ‘j=0Lsjtaskhk,jLMELMO_k^{task} = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM}

This final Ī³\gamma parameter is a learnable parameter used for scaling the whole final ELMo vector. It was important from an optimization point of view, due to the different distributions between the bilateral internal representations and the task specific ones.

Conclusions

In conclusion, ELMo is a powerful deep-learning model that has the ability to generate contextualized word representations, which can greatly enhance the performance of natural language processing tasks.

References

Related articles:

    background

    05 December 2022

    avatar

    Francesco Di Salvo

    45 min

    30 Days of Machine Learning Engineering

    30 Days of Machine Learning Engineering

    background

    16 January 2023

    avatar

    Daniele Moltisanti

    6 min

    Advanced Data Normalization Techniques for Financial Data Analysis

    In the financial industry, data normalization is an essential step in ensuring accurate and meaningful analysis of financial data.

    background

    17 January 2023

    avatar

    Francesco Di Salvo

    10 min

    AI for breast cancer diagnosis

    Analysis of AI applications for fighting breast cancer.

    background

    14 November 2022

    avatar

    Francesco Di Gangi

    5 min

    Artificial Intelligence in videogames

    Artificial Intelligence is a giant world where we can find everything. Also videogames when we don't even notice...

    background

    18 November 2022

    avatar

    Gabriele Albini

    15 min

    Assortativity Coefficients in social networks data

    Assortativity helps analysing pattern of connections in networks. Letā€™s use it to confirm if people tend to connect to similar people.

JoinUS