What are embeddings?

alex_ber
5 min readJun 21, 2024

--

Embedding is a technique used in natural language processing (NLP) and machine learning to represent words, phrases, or other discrete data in a continuous vector space R^n (more rigorously we’re not using real R number, but discrete approximation such as float32 and float16, see IEEE 754 single-precision or half-precision floating-point format). The goal is to capture the semantic meaning of the data so that similar items will look “similar” to each other in this vector space. Embeddings are particularly useful because they allow algorithms to work with high-dimensional data in a more efficient and meaningful way.

I will try to demonstrate on toy example. This specific algorithm is called Word2Vec.

Word2Vec

Consider we have toy corpus of sentences:

  1. “I love machine learning”.
  2. “Machine learning is great”.
  3. “I love coding”.
  4. “Coding is fun”.

We will make tokenization by splitting each sentence into words.

  • Sentence 1: [“I”, “love”, “machine”, “learning”]
  • Sentence 2: [“Machine”, “learning”, “is”, “great”]
  • Sentence 3: [“I”, “love”, “coding”]
  • Sentence 4: [“Coding”, “is”, “fun”]

Now we will join all words removing duplicates:

  • Vocabulary: [“I”, “love”, “machine”, “learning”, “is”, “great”, “coding”, “fun”]

Now, we will do, what is called one-hot-encoding, we will turn words into numbers so that a computer can understand them. Every word is associated with index that it is appear in vocabulary. And our vocabulary has size 8. So, we can use 8-dimensional bool vector. Each word has a unique code where only one position is 1 (True), and all the others are 0 (False). This is called one-hot encoding.

For example:

“I” has index 0 in our vocabulary, so in 8-dimensional representation vector it will have 1 (True) on index 0 and 0 (False) on every other index.

  • “I” will be [1, 0, 0, 0, 0, 0, 0, 0]

“love” has index 1 in our vocabulary, so in 8-dimensional representation vector it will have 1 (True) on index 1 and 0 (False) on every other index.

  • “love” will be [0, 1, 0, 0, 0, 0, 0, 0]

“machine” has index 2 in our vocabulary, so in 8-dimensional representation vector it will have 1 (True) on index 2 and 0 (False) on every other index.

  • “machine” might be [0, 0, 1, 0, 0, 0, 0, 0]

Word2Vec uses Skip-gram model to learn word associations from a large corpus of text (not a toy example above). The Skip-gram model tries to learn which words usually appear near each other. So, it will learn that “I” and “machine” often appear near “love.” You can think about this as if you have a magic book that knows which words go together. When you see the word “love,” the book tells you that “I” and “machine” are often nearby. The more you read, the better the book gets at guessing which words are friends and like to hang out together.

The Skip-gram model works by predicting the context words given a target word. For example, in the sentence “I love machine learning,” if “love” is the target word, the model will try to predict “I” and “machine” as the context words.

How Skip-gram Works

  1. Input Layer: The input to the model is the one-hot encoded vector of the target word. For example, if the target word is “love,” the input row vector is [0, 1, 0, 0] with dimension 4. It can be seen as 4×1 (4 rows and 1 column) dimension as matrix.
  2. Hidden Layer: The input vector is multiplied by a weight matrix to produce a hidden layer. This weight matrix is what we are trying to learn. If our embedding dimension is 2, the weight matrix might look like this (randomly initialized):
    [[0.8, 0.1],
    [0.9, 0.2],
    [0.4, 0.7],
    [0.3, 0.8]]
    It is 4×2 matrix (4 rows and 2 column) dimension. So, the result, the hidden layer as matrix has 2×1 (2 rows and 1 column) dimension as matrix or row vector with dimension 2.
  3. Output Layer: The hidden layer is then multiplied by another weight matrix, called context matrix, I will call it “second matrix”, to produce the output, which is a vector of probabilities for each word in the vocabulary (for each word in a sentence (the target word), the model tries to predict the words that are likely to appear near it (the context words).

Learning the Vectors

When the model is trained, it gradually adjusts its internal parameters (the weight matrix) to improve prediction accuracy. This process occurs as follows:

  1. Training iterations: The model repeatedly makes predictions, compares them with actual data, and updates its weights. Through thousands or even millions of such iterations, the model finds optimal weight values that minimize prediction errors on all training data.
  2. Understanding context: During training, the model starts to recognize patterns in the data. It notices which words frequently appear together and adjusts its weights to better predict these words in the future. Words that often appear in similar contexts receive similar numerical representations (vectors). For example, the words “king” and “queen” might frequently appear alongside words like “throne,” “castle,” “royal.” Therefore, their vectors will be similar because the model learns to predict them based on the same surrounding words.

Extracting the Vectors

After training, the rows of the weight matrix in the hidden layer become the word vectors. For example, if our weight matrix is:

[[0.8, 0.1],
[0.9, 0.2],
[0.4, 0.7],
[0.3, 0.8]]

The word vectors are:

  • “I”: [0.8, 0.1]
  • “love”: [0.9, 0.2]
  • “machine”: [0.4, 0.7]
  • “learning”: [0.3, 0.8]

Note: Each row directly corresponds to a word in the vocabulary.

These vectors are dense representations of the words, capturing their meanings based on the contexts in which they appear. The dense vectors are not just about direct word co-occurrence but also capture deeper semantic similarities. For instance, “king” and “queen” might have similar vectors because they both relate to royalty, even if they don’t always appear together in text.

Note: While typically the matrix weight from input layer to hidden layer as described above is used as embeddings, the matrix weight from hidden layer to output layer can also be used to obtain word embeddings, and some studies have explored combining both matrices to enhance the quality of the embeddings. Each column in the second weight matrix can be interpreted as embeddings, but they represent the words in the context of predicting surrounding words rather than as direct word embeddings. This is more indirect approach.

Word2Vec Summary

The Skip-gram model is trained to predict the context words given a target word. By doing so, it learns to capture the relationships between words based on their co-occurrence patterns in the corpus. This predictive task is what drives the learning of meaningful word embeddings, which can then be used for various natural language processing tasks.

In general, to extract embeddings from a deep learning network, the network typically needs to be designed for a task that involves learning meaningful representations of the input data such as classification (assigning input data to one of several predefined categories), language modeling (predicts the next word in a sequence given the previous words), autoencoders (reconstructing the input data from a compressed representation), and more. As we see above different layers of the neural network can be used as embeddings.

Summary

While employing a suitable deep learning network is a common approach, it’s not the only option. For instance, you can create embeddings based on character counts, which doesn’t require any model. However, such embeddings are generally not recommended in practice.

--

--