Machine Learning | Dimensionality Reduction

Scenario

This is a continuation of the article DimensionalityReduction

Exploration

In practice Autoencoders can be used to dimensionality reduction and they work by the following idea.

Encoder block takes a high dimension vector and projects it to a lower dimension
Decoder block takes the output of the Encoder and attempts to reconstruct the original vector
After training, the encoder output can be used for compressing the high dimensional vectors.

However, this exercise puts a slight twist on the and removes the decoder block altogether.

Because of the objective is to get vectors that have the same relative similarity in the lower dimensional space as the relative similarity of the corresponding vectors in the higher dimensions, why not define the loss criterion the same way?
Hence, we will define the loss criterion as \(MSE(a, b)\) where \[a = batch\_cosine\_similarity\_matrix(high\_dimension)\] and \[b = batch\_cosine\_similarity\_matrix(projected\_dimension)\]
After training, the encoder output can be used for compressing the high dimensional vectors.

Sources

Same as the ones used in DimensionalityReduction. Vectors are sourced from the model deepseek-coder-1.3b-base


        model_name = "deepseek-ai/deepseek-coder-1.3b-base"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name)
        print(f"DeepSeek-Coder model '{model_name}' and tokenizer loaded.")

Top 512 words of the english language have been encoded using the model above

How to reduce dimensions?

Previously, we have seen
- Autoencoder
- Variational Autoencoder
- Principal Component Analysis (PCA)
- Locally Sensitive Hashing (LSH)

How to determine the best candidate?

Calculate all combinations pairwise similarities for a subset of the vectors
Pick a ballpark output dimension size
Experiment with the available algorithms and various output dimensions
Calculate the similarities in the lower dimensional space
Subtract the corresponding similarities pairs of higher and lower dimensions
Plot the distributions
Pick the method with a distribution centered normally with a mean closest to 0 and with the least variance

Keys

There are several abbreviations here and the guide below will help with the reading

Original 2048 \(\rightarrow\) Raw vectors from the deepseek-coder-1.3b-base model
Cosine Similarity Loss Encoder | 0016 \(\rightarrow\) Encoder with output dimensions of 0016
Cosine Similarity Loss Encoder | 0032 \(\rightarrow\) Encoder with output dimensions of 0032
Cosine Similarity Loss Encoder | 0064 \(\rightarrow\) Encoder with output dimensions of 0064
Cosine Similarity Loss Encoder | 0128 \(\rightarrow\) Encoder with output dimensions of 0128
Cosine Similarity Loss Encoder | 0256 \(\rightarrow\) Encoder with output dimensions of 0256
Cosine Similarity Loss Encoder | 0512 \(\rightarrow\) Encoder with output dimensions of 0512
Cosine Similarity Loss Encoder | 1024 \(\rightarrow\) Encoder with output dimensions of 1024

Observations

From the distributions below, we can observe the following in the similarity space.

Since all of the encoders are presented with the same type of loss (cosine similarity loss), the histograms of the projected vectors look very similar to the histogram of raw data.
There is little difference between the histogram profiles.
So the next decision-making aspect would be the size of the vector that is practical for the use case

Victors?

Since the histogram profiles are similar we can use either

Cosine Similarity Loss Encoder | 0128
Cosine Similarity Loss Encoder | 0256

Distribution Charts

1. Base Distributions in Cosine Similarity Space

2. Difference distributions in Cosine Similarity Space

These are obtained by subtracting pairwise cosine similarities between the original and reduced dimensions. For example, a given data point would be \(a_{2048} - a_{E\ 0064}\) where \(a_{2048}\) is the cosine similarity between a pair of words and \(a_{E\ 0064}\) is the cosine similarity between the exact same words but with the vectors obtained from the encoder having an encoder output dimensions of 64. Each distribution is made up of nearly 130,000 datapoints.

3. Base and Difference distributions in Cosine Similarity Space

Both the above in the same plot :)

Machine Learning | Dimensionality Reduction | Cosine Similarity Loss