Decoding the Mystery🧙‍♂- How Computers Interpret Human Language

Milos Vukadinovic,Sun Aug 27 2023•paper reviews

How can a computer understand human language? At the most basic level, everything in the computer is represented as 1s and 0s. That is why our operating system does not really 'understand' the words we type/give it. On the high level the OS just has a giant lookup table (hashtable), and translates each letter (or symbol) to a corresponding binary code that the computer will understand.

ℹ️

Tangent: If you want to know more about the optimal way to assign binary codes to a set of signals see Huffman Coding (opens in a new tab), and read about Shannon Entropy (opens in a new tab). The most optimal way to assign codes to a set of predetermined signals is a solved problem, but picking what signals (letters, words, tokens?) to encode is unsolved (it is believed to be AI hard problem, see Hutter Prize (opens in a new tab))

However, if a computer just assigns unique codes to unique letters, that does not mean that it can understand the human language. It will not know that the code for the word "dog", and the code for the word "puppy" are synonymous, and it wouldn't know how to structure a sentence without breaking grammer rules (E.g. prepositions must be followed by an object). But, modern LLMs like ChatGPT and BARD have a very good understanding of human language and grammar structure, so how do they do it? In a nutshell, the idea is to assign unique $N$ dimensional vectors to each word, such that this $N$ dimensional vector space can capture the structure of the human language. For example, in the structured vector space the distance between vectors corresponding to words dog and puppy will be small but the distance between dog and train will be big.

I will review the paper GloVE: Global Vectors for Word Representations (opens in a new tab) that explains how to properly assign vectors to words and have a nicely structured latent vector space. The main goal of the approach is to embed analogy in the latent vector space, in other words the authors want the model to be able to solve the questions like "a is to b as c is to ". For example: "AIDS is to HIV as coronavirus disease is to (___)" Answer:SARS-COV-2, in this case, the model should know virus-disease relationship.

At the time of writing this paper, there were two major approaches for vectorizing text: from co-occurence matrix (LSE) and using sliding window (word-to-vec was state-of-art). GloVE approach lies somewhere in between these two approaches. The authors argue that it's important to get ratios of co-occurences, and not only the co-occurence probabilities. So, instead of just computing $P(i \; | \; j)$ , we should ask what is $R = \frac{P(i \; | \; k)}{P(j \; | \; k)}$ . This way we can get more information from the text:

$R >> 1$ means that the word $k$ is related to the word $i$ but not to the word $j$ .
$R << 1$ means that the word $k$ is related to the word $j$ but not to the word $i$ .
$R \sim 1$ means that $k$ is either related to both words $i$ and $j$ , or to neither

Using this logic, the authors develop the following model for learning vector embeddings (you can read more about how they came up with the model in their paper).

$J = \sum_{i,j} f(X_{ij}) (w_{i}^T \tilde{w}^{T}_{j} + b_i + \tilde{b}_j - log X_{ij})^2$

Where $X$ is a co-occurence matrix. $X_{ij}$ is the number of times word $i$ and word $j$ appear in the same context window.

$w_{i}$ is a latent vector corresponding to the word $i$ , and $b_{i}$ is learnable bias.

$\tilde{w_{i}}$ is a context latent vector corresponding to word $i$

I attempted minimalistic (small dataset size, context window and number of epochs) implementation of GloVE and got good initial results (loss is going down and latent vector show some meaning). You can find the code here: https://colab.research.google.com/drive/1UBGoZFBIBX3BmfZwXrxmC_n_zMOnsdCK?usp=sharing (opens in a new tab) Training setup should look something like this.

V, D = 5000, 100
W = torch.randn(V, D, requires_grad=True)
Wcontext = torch.randn(V, D, requires_grad=True)
b = torch.randn(V, requires_grad=True)
bcontext = torch.randn(V, requires_grad=True)
 
X = torch.from_numpy(co_occur_mat)
# apply smoothing
X = X+1e-5
 
xmax=100
alpha=3/4
def f_tensor(x):
    condition = x < xmax
    return torch.where(condition, (x / xmax) ** alpha, torch.ones_like(x))
 
# Hyperparameters
## The training loop is very basic and it needs to be improved
## Ideally the loss will be close to 0
learning_rate = 0.0001
num_epochs = 3000
 
for epoch in tqdm(range(num_epochs)):
 
    J = torch.sum(f_tensor(X) * (torch.mm(W, Wcontext.t()) + b.unsqueeze(1) + bcontext.unsqueeze(0) - torch.log(X))**2)
    print(J)
    J.backward()
 
    with torch.no_grad():
        W -= learning_rate * W.grad
        Wcontext -= learning_rate * Wcontext.grad
        b -= learning_rate * b.grad
        bcontext -= learning_rate * bcontext.grad
 
    W.grad.zero_()
    Wcontext.grad.zero_()
    b.grad.zero_()
    bcontext.grad.zero_()

In the google colab notebook above, I also loaded the official GloVE 6B weights, and showed example of word similarity and analogy.

In summary, computers can learn meaning of the words from the data by assigning to each word a vector in the structured latent space. GloVE and word-to-vec are methods describing how to assign that vector.