Vector Databases are Awesome 🤩, but why?

Milos Vukadinovic,Sun Sep 03 2023•concept research

Vector databases utilize AI embeddings and efficient schemas to store data in a structured format, that allows querying to be easier, faster, more accurate, and more intuitive. Such database has insights into the context of data which allows for fuzzy search. (Search: 'sallary' would also return 'compensation' related results).
Vector databases represent a pinnacle of collaborative advancements between AI and database systems, resulting in efficient scalability of data.

Vector databases such as Pinecone (opens in a new tab), Weaviate (opens in a new tab) and ChromaDB (opens in a new tab) are a hot topic in the tech field today. I decided to do some research of the concept and understand the excitement behind them. I read many different sources, but I found that Weaviate was the most transparent about their architecture and implementation so I will focus on describing the Weaviate database.

Inefficiencies of traditional DBs

Take for example SQL (opens in a new tab) and mySQL (opens in a new tab) database serving data to the user through REST API (opens in a new tab). Also, consider the following simple schema:

We will use the real free REST API endpoint that returns the cover of books. We can send an API request through bash(or zsh) shell using curl as follows:

curl https://archive.org/services/img/theworksofplato01platiala --output cover.png

Once the server receives the request, it uses SQL to query its database: SERVER SIDE:

SELECT book_cover FROM books WHERE book_name="theworksofplato01platiala" LIMIT 1;

The server then sends the binary data corresponding to the book cover, on the client side we write it to "cover.png", and we get this puzzling cover of Plato's book.

Pretty cool right? But what if we want to find "a book that talks about the soul, epistemiology and ethics". This is a bit more complicated.

Also, what if we wanted to pose a query "Find me a book cover that looks similar to this one INSERT_IMAGE".

In the current setup we would need someone to describe each book or each image, and assign search keywords such as "epistemiology" or "green patterns" to it, and then also determine search keywords from the query, and finally perform keyword search. Even after all this effort to convert the data to keywords, this seach is suboptimal.

Another problem of traditional approach is reflected in this query "Give me the author of "Crime and Punishment" and all the books he has written." The problems are:

SQL speed $\rightarrow$ To fetch information multiple joins might be needed
complex API design $\rightarrow$ The API developer would need to design an endpoint specifically for this type of query, and for every new query type new endpoint will be needed.

Weaviate vector database solves this problems using AI embeddings and GrapQL.

AI embeddings

Machine learning techniques allow us to represent a set of data points in a lower-dimensional vector space which captures their underlying relationships and patterns. A vector in that lower-dimensional space corresponding to a data point is called the vector embedding. Vector embeddings should ideally be:

semantically meaningful : data points with similar meaning should be close in a vector space ( dist(cat,dog) < dist(cat, airplane) )
robustness : small perturbations in the vector should lead to small changes in meaning
linear : vector space should capture analogies ( $king\_vec - man\_vec + woman\_vec = queen\_vec$ )

For example, imagine that you have a function $F(\cdot)$ that maps $(256 \times 256)$ image to a point in $512$ dimensional vector space. Say that we have image $X$ of a Harry Potter book cover and image $Y$ of Dr. Strange comics cover, such that the pixelwise difference between $X$ and $Y$ is high. $F(X)$ and $F(Y)$ are close in the $512$ dimensional vector space, because they both show the image of a wizard. $F(X)$ and $F(Y)$ are vector embeddings of the original data points.

On this example, we can see how to solve the problem with the query "Find me a book cover that looks similar to Dr. Strange comics cover". Namely, we will use $F(\cdot)$ to find the embedding corresponding to a Dr. Strange comics cover, and then find embeddings in its neighbourhood, find book covers corresponding to the neighbourhood embeddings and output that as the result of the search qeury.

Additionally, we could answer the query "a book that talks about soul, epistemiology, and ethics" as follows: Represent each book in a database as a point in the text latent space, when searching embed the query in the latent space and find the closest book-point.

In reality, what we call a function $F(\cdot)$ is called an encoder. Weaviate provides many different encoders (text-to-vec-openai, img-to-vec-neural, multi-to-vec-clip), but we can also import our own encoder. Using different encoders we can answer a wide-variety of queries.

For instance, CLIP architecture makes possible to embed both images and text in the same vector space. If we choose to use CLIP encoder, we can solve another problem that appears in traditional search approaches. We can answer the query "a green book cover with mysterious circular patterns", by using the similar approach as in two previous examples (do you see how?).

GraphQL and RDF-like database

GraphQL is a query language aiming to make API queries faster and more intuitive. It is important to note that GrapQL is a query language for APIs not for databases, and can be used with any database. To understand how GrapQL works, let's take a look at this Star Wars GraphQL API (opens in a new tab). We can pose a complicated query to get all kinds of data with one request:

query Query {
  allFilms {
    films(first:2) {
      title
      director
      releaseDate
      speciesConnection {
        species {
          name
          classification
          homeworld {
            name
          }
        }
      }
    }
  }
}

In curl this looks like

curl -g \
-X POST \
-H "Content-Type: application/json" \
-d '{"query":"query Query { allFilms(first:2) { films { title director releaseDate speciesConnection { species { name classification homeworld { name } } } } }}"}' \
https://swapi-graphql.netlify.app/.netlify/functions/index

And the output is:

{
   "data":{
      "allFilms":{
         "films":[
            {
               "title":"A New Hope",
               "director":"George Lucas",
               "releaseDate":"1977-05-25",
               "speciesConnection":{
                  "species":[
                     {
                        "name":"Human",
                        "classification":"mammal",
                        "homeworld":{
                           "name":"Coruscant"
                        }
                     },
                     {
                        "name":"Droid",
                        "classification":"artificial",
                        "homeworld":null
                     },
                     {
                        "name":"Wookie",
                        "classification":"mammal",
                        "homeworld":{
                           "name":"Kashyyyk"
                        }
                     },
                     {
                        "name":"Rodian",
                        "classification":"sentient",
                        "homeworld":{
                           "name":"Rodia"
                        }
                     },
                     {
                        "name":"Hutt",
                        "classification":"gastropod",
                        "homeworld":{
                           "name":"Nal Hutta"
                        }
                     }
                  ]
               }
            },
            {
               "title":"The Empire Strikes Back",
               "director":"Irvin Kershner",
               "releaseDate":"1980-05-17",
               "speciesConnection":{
                  "species":[
                     {
                        "name":"Human",
                        "classification":"mammal",
                        "homeworld":{
                           "name":"Coruscant"
                        }
                     },
                     {
                        "name":"Droid",
                        "classification":"artificial",
                        "homeworld":null
                     },
                     {
                        "name":"Wookie",
                        "classification":"mammal",
                        "homeworld":{
                           "name":"Kashyyyk"
                        }
                     },
                     {
                        "name":"Yoda's species",
                        "classification":"mammal",
                        "homeworld":{
                           "name":"unknown"
                        }
                     },
                     {
                        "name":"Trandoshan",
                        "classification":"reptile",
                        "homeworld":{
                           "name":"Trandosha"
                        }
                     }
                  ]
               }
            }
         ]
      }
   }
}

Now we solved a problem related to "Give me the author of "Crime and Punishment" and all the books he has written." query. But in the backend SQL will still have to perform complicated and time-consuming joins. To solve this, we can use a graph database instead - RDF database (opens in a new tab). RDF stores data in triplets composed of "subject-predicate-body" ("Fyodor Dostoevsky wrote Crime and Punishment"). Such structure will allow for faster retrieval of triplets, and help us solve the speed issue with our latest query example.

ℹ️

While Weaviate doesn't explicitly use RDF, their graph database has RDF-like features.

Use case

In this section I will essentially copy some code from the Weaviate's official quickstart google colab tutorial (opens in a new tab) and comment on it. Weaviate also has great guides on how to build apps utilizing a vector database such as image search app (opens in a new tab), advenced drug search app (opens in a new tab), and chatting with PDF (opens in a new tab)

First, we can install weaviate client pretty easily using pip.

pip install -U weaviate-client

Then we can go to weaviate web app, create the account, and create a free sandbox instance on https://console.weaviate.cloud/ (opens in a new tab). We should be able to copy the API key from there (YOUR-WEAVIATE-API-KEY), and the endpoint (CLUSTER_URL). Additionally, because we will use openai embeddings in this example, we need to obtain an API key from https://platform.openai.com/account/api-keys (opens in a new tab). (OPENAI_API_KEY) Then in Python, connect to the sandbox instance.

import weaviate
import json
import os
 
client = weaviate.Client(
     url = "CLUSTER_URL", 
     auth_client_secret=weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),
     additional_headers = {"X-OpenAI-Api-Key": "OPEN_API_KEY"}
)

Next, we define a class which tells weaviate how to store and process the data we upload. The following code says: Make a placeholder for the object "Question", and get the embeddings for it by using text2vec-openai encoder.

class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {},
    }
}
client.schema.create_class(class_obj)

Then, we use the jeopardy question data and upload it to weaviate as follows:

import requests
url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)
 
# Configure a batch process
with client.batch(
    batch_size=100
) as batch:
    # Batch import all Questions
    for i, d in enumerate(data):
        print(f"importing question: {i+1}")
 
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
 
        client.batch.add_data_object(
            properties,
            "Question",
        )

And finally, we can pose a query, asking to get 2 questions that correspond to the concept of biology.

nearText = {"concepts": ["biology"]}
 
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_limit(2)
    .do()
)
 
print(json.dumps(response, indent=4))

The output I get is:

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "answer": "species",
                    "category": "SCIENCE",
                    "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
                }
            ]
        }
    }
}

This describes my journey of learning about vector databases, thank you for reading until the end, I hope you find this post helpful!