In memory similarity search

11 min read6 days ago

Finding most similar string using cosine similarity

Informal definition

In general form the problem is:

Given some text corpus, find in it information that is relevant for some query (question).

There many enterprise-grade (which implies heavyweight) solutions for this problem. In general, they splits process to two parts:

Vector store initialization: tokenization (splitting into chunks) the text corpus, and store it somehow as text and corresponding embeddings (numeric representation of the text; I will go back to this point below). Storage can be database, can be file, can be Redis cluster, etc.
Queering vector store: Given query string we want to find “similar” text in our corpus that hopefully can be used by the end-user. For example, query can be question and what we’re retrieving is answer for it. Naïvely we should check all chunks vs query string.

For a big text corpus, however, it is not feasible to run “similarity function” on all chunks, so there many heuristics how to reduce number of time “similarity function” is applied, such as k-Means Nearest Neighbor (kMNN), k-Means++, approximate Nearest Neighbor (ANN) Search and many more.

I want to solve much less ambitious problem:

Given input string and small list of strings find one string that is most “similar”.

The source code you can found here. It is available as part of my AlexBerUtils s project.

You can install AlexBerUtils from PyPi:

python -m pip install -U alex-ber-utils[np]

You need to install numpy. Also you need to use some embeddings class. More om this a bit later.

See here for more details explanation on how to install.

My use case

To clarify the problem I’m addressing, imagine a team-based game. Let’s say the team consists of 10 people. The host asks a question and provides a set amount of time, say one minute, for the team to come up with an answer. During this minute, all team members are allowed to discuss and propose different answers. The team captain is responsible for selecting the final answer. In my game, the captain cannot create an answer “out of the blue” on their own; they must discuss any potential answer within the given minute. However, if the captain mentions their answer during discussion, it is acceptable. The game is broadcasting live. So, imaging captain answers the question. For the viewers it is not immediately clear who was the author of the answer. First of all, when 10 people talk in short period of time it is hard to hear all versions. Second, captain will almost never say the answer literally, but it will be pretty close, “similar” to the original answer.

We want to develop an assistant for my game that can extract all proposed answers and their authors from the discussion. Let’s assume this is a solvable problem in near-real-time. The assistant will then take the captain’s answer, compare it with the proposed answers and their respective authors, and output whose version the captain just mentioned.

Now, refer back to the description of my problem, and you will see that this is exactly what I am addressing.

As a bonus, putting aside embeddings (vector’s representation of the text) all calculation in my solution will occur in-memory.

Embeddings

I’ve extracted machine learning theory about emeddings into seperate post.

In practice, in order to use good embeddings you need to add some heavyweight dependency. There are two problems with this: I don’t want to require using one model or another for calculate embeddings and I want that my project rely on smallest number of dependencies as possible. And lightweight, if possible.

Note: I’m getting ahead of myself a little bit, but I want to clarify the importance of numpy. To perform calculations like cosine similarity, numpy is indispensable. Interestingly, numpy has no external dependencies, and there are wheels available with precompiled C/Fortran code. This means you don’t need to compile anything during installation, and the wheel size is quite small.

So, I don’t want to include any specific implementation of embedding; instead, I want the embedding method to be provided by the client that call my library.

I don’t want to include any specific implementation of embeddings; instead, I want the embedding method to be supplied by the client who uses my library.

But I need to use it the code in some concrete way. So, I have two problem to solve:

How exactly to use some unknown-in-advance class that should calculate embedding.
How to communicate this contract to the client’s of my library.

The first problem is pretty easy to solve. Just look in some well-established library.

Langchain is open-source library that offer abstraction layer to use different LLM related tools from different vendors. This project is only 1.5 years old, but it streamlines many processes that otherwise will be cumbersome to do.

"""**Embeddings** interface."""
from abc import ABC, abstractmethod
from typing import List

from langchain_core.runnables.config import run_in_executor


class Embeddings(ABC):
    """Interface for embedding models."""

    @abstractmethod
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed search docs."""

    @abstractmethod
    def embed_query(self, text: str) -> List[float]:
        """Embed query text."""

    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
        """Asynchronous Embed search docs."""
        return await run_in_executor(None, self.embed_documents, texts)

    async def aembed_query(self, text: str) -> List[float]:
        """Asynchronous Embed query text."""
        return await run_in_executor(None, self.embed_query, text)

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/embeddings/embeddings.py

One of concrete implementations is OpenAIEmbeddings, see https://github.com/langchain-ai/langchain/blob/master/libs/partners/openai/langchain_openai/embeddings/base.py

Let’s look closely on the interface above. It has 2 abstract sync method and 2 async method with default implementation.

Side note: run_in_executor method mention in the source-code doing the same as asyncio.to_thread(). You can read here for more details.

While, having async method to get embedding for the text, in my concrete case I don’t need them. So, we left only with 2 regular method.

Let’s examine embed_query() . It receives text as string and returns list of float (Python’s approximation to the vector of real numbers).
Let’s examine embed_documents(). It receives list of text and return for each text list of floats.

When I examine the real usage, I’ve noticed that embed_query() is almost never used. Almost all code that I see prefer to use embed_documents() passing list of string and taking result from 0’s index. embed_query() can be implemented with more general embed_documents(). Like this:

v = embeddings.embed_documents([text])[0]

So, I decided to use embed_documents().

Ok. This solves the first problem: how exactly to use some unknown-in-advance class that should calculate embedding.

Now, what about how to communicate this contract to the client’s of my library. I see following options:

Write in documentation (in docstrings of the publicly available methods).
Create interface based on one from langchain, but only with one actually used method.
Use type hint and benefit from duck typing.

Quote from my post about Structured Polymorphism (Duck Typing):

In [Python] Protocols formalize the concept of duck typing by allowing you to define expected behaviors (methods and properties) that a type must implement. This provides several benefits:

Type Checking: Protocols enable static type checkers like mypy to verify that objects conform to the expected interface, catching errors at compile time rather than runtime. Side note: In Python adherence to interface is checked in runtime.
Documentation:

Protocols serve as a form of documentation, making it clear what methods and properties an object must implement to be used in a particular context.

3. Flexibility: Like duck typing, protocols allow for flexible and interchangeable use of different types, as long as they adhere to the defined interface.

https://alex-ber.medium.com/structured-polymorphism-duck-typing-6ac004a3f631

So it is clear that Protocol is the way to go:

from typing import List, Protocol

#for example
#from langchain_openai import OpenAIEmbeddings
class Embeddings(Protocol):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        ...

SimpleEmbeddings

Quote from above:

I don’t want to include any specific implementation of embeddings; instead, I want the embedding method to be supplied by the client who uses my library.

Well, this is not 100% accurate. I don’t want to include any production-grade implementation of embeddings. But I do want to include some implementation. It surve 2 puproses: unit-testing and education porpuse.

Well, this isn’t entirely accurate. I don’t intend to include a production-grade implementation of embeddings. However, I do want to provide some implementation. It serves two purposes: unit-testing and educational purposes.

Let’s look on the code first:

class SimpleEmbeddings:
    def __init__(self, dims: int = 1536):
        self.dims = dims

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        embeddings = []
        for text in texts:
            embedding_vector = [0] * self.dims
            #This effectively counts the occurrences of each character in the text, mapped to a fixed-size vector.
            for char in text:
                index = hash(char) % self.dims
                embedding_vector[index] += 1
            embeddings.append(embedding_vector)
        return embeddings

__init__() method. Initializes an instance of the class with a specified dimension size for the embeddings. This dimension size (self.dims) determines the length of the embedding vectors. Default value 1536 is chosen to corresponded to the vector dimensions in OpenAIEmbeddings. You can change it, of course.
embed_documents() receives texts (a list of strings) and returns a list of lists of floats (each inner list is an embedding vector).
For each text in texts:
An embedding vector of length self.dims (1536 by default) is initialized with all zeros.
The method iterates over each character in the text.
For each character, it computes an index using the hash function and the modulo operation (% self.dims). This ensures that the index is within the bounds of the embedding vector.
The value at the computed index in the embedding vector is incremented by 1. This effectively counts the occurrences of each character, mapped to a fixed-size vector.
After processing all characters in the text, the resulting embedding vector is appended to the embeddings list.
The method returns the list of embedding vectors, one for each input text.

The SimpleEmbeddings class provides a simple way to generate fixed-size embedding vectors for a list of text documents. The embedding vectors are created by counting the occurrences of each character in the text and mapping these counts to a fixed-size vector using the hash function. This approach ensures that the resulting embeddings have a consistent size (self.dims), regardless of the length of the input texts.

This method is quite rudimentary and primarily serves as an example of how to create fixed-size embeddings. In practice, more sophisticated methods (e.g., Word2Vec, GloVe, BERT) are used to generate embeddings that capture semantic meaning and context.

You can find my unit-test here. I test it on set of the question related to the soccer. It did well, but one of the answer was variant of “I don’t know” and another is actual answer, so it should be relatively easy to distinguish between them. Note, however, that in some cases the difference was very small.

find_most_similar

Now, let looks on API.

Note: For details about Embeddings see above.

from alexber.utils.in_memory_similarity_search import find_most_similar, find_most_similar_with_scores

First I want to show and explain implementation of find_most_similar function:

def find_most_similar(
        embeddings: Embeddings,
        input_text: str,
        /,
        *args: List[str],
        verbose=True
) -> Tuple[int, str]:
    logger.info("find_most_similar()")
    sorted_similarities_l: List[Tuple[Tuple[int, str], float]] = \
        find_most_similar_with_scores(embeddings, input_text,*args, verbose=verbose)
    return sorted_similarities_l[0][0]

This is kind of wrapper around find_most_similar_with_scores() all parameters are passed into it and it receives sorted in descended order list of elements from most similar to least similar. sorted_similarities_l[0] takes most similar element that has form of ((idx, example), score). Than it takes only (idx, example) part and return it as Tuple[int, str].

Cosine similarity in Euclidean space

For two vectors a(x1,x2,…,xn) in R^n and b=(y1,y2,…,yn) in R^n

where dot product:

norm:

(a*a in norm definition is dot product above).

Note: The value of cosine similarity ranges from -1 to 1.
* 1 indicates that the vectors are identical in direction.
* 0 indicates that the vectors are orthogonal (perpendicular).
* -1 indicates that the vectors are diametrically opposed.

For very detail explanation see my cosine similarity post.

Implementation details of find_most_similar_with_scores

Before I’ll delve into a detail, I want to provide high-level implementation:

For input text and list of text, their corresponded embeddings are calculates.
Cosine similarity [see above] is calculated for all pairs where first element is embedding of input text and second element is from list of text.

3. They are sorted in descendant order (recall that 1 means identical in direction and -1 means diametrically opposed).

A few notes: As described above I assume that number of string in the list is low, so there is no memory or time constraint to compute cosine similarity to every element in the list.
Also, I remind that it is up to the caller to supply Embeddingsobject.

Now, lets look on this closely:

Embedding Calculation: Calls _calc_embedding_as_matrix() to convert the input_text into an embedding vector. This function uses the embed_documents() method of the Embeddings instance and reshapes the result into a 2D numpy array. Numpy prefers matrices over vector. For example, vector v=[1, 2, 3, 4] becomes [[1 2 3 4]] numpy 2D array.
Creates a dictionary where each key is a tuple (index, text) and each value is the corresponding embedding vector. This is done for all texts in args.
Uses np.vstack() to stack all embedding vectors vertically into a single 2D numpy array. This creates a matrix where each row corresponds to an embedding vector of a comparison text. This is done to make calculations in one go.
Calculates the norm of the input_v vector using np.linalg.norm. The axis=1 parameter ensures the norm is calculated along the rows, keepdims=True retains the dimensions. Because we reshape all into matrices, this means that we’ll get numpy 2D array.
Calculates the norm of the example_matrix using np.linalg.norm. The axis=1 parameter ensures the norm is calculated along the rows, keepdims=True retains the dimensions. Because we reshape all into matrices, this means that we’ll get numpy 2D array.
The dot product np.dot(input_v, example_matrix.T) computes the dot product between each vector in input_v and each vector in example_matrix. The result of the dot product is a matrix where each element (i, j) is the dot product of the i-th vector in input_v and the j-th vector of example_matrix.
To normalize the dot product matrix, we need to divide each element (i, j) of the dot product matrix by the product of the norms of the corresponding vectors. This can be efficiently done using the np.outer() to compute the outer product of input_norm and example_norms. This results in a matrix where each element is the product of the corresponding norms.
We perform element-wise division to obtain the cosine similarities. Divides the dot product matrix by the outer product of norms to get the cosine similarity matrix. Cosine similarity measures the cosine of the angle between two vectors, which is a measure of their similarity.
Replaces any NaN (Not a Number) or Inf (Infinity) values in the similarity matrix with 0.0 to avoid numerical issues.
Creates a dictionary where each key is a tuple (index, text) and each value is the corresponding similarity score from the similarity matrix.
Sorts the dictionary items by similarity score in descending order.
If verbose is True, logs the input text and the sorted similarity scores.