Machine Learning | Dimensionality Reduction


Scenario
Exploration
Sources

Original Vectors sourced from the model deepseek-coder-1.3b-base


        model_name = "deepseek-ai/deepseek-coder-1.3b-base"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name)
        print(f"DeepSeek-Coder model '{model_name}' and tokenizer loaded.")
        

Top 512 words of the english language have been encoded using the model above

How to reduce dimensions?
How to determine the best candidate?
Keys

There are several abbreviations here and the guide below will help with the reading

Observations

From the distributions below, we can observe the following in the similarity space.

Victors?

Distribution Charts

1. Base Distributions in Cosine Similarity Space

2. Difference distributions in Cosine Similarity Space

These are obtained by subtracting pairwise cosine similarities between the original and reduced dimensions. For example, a given data point would be \(a_{2048} - a_{AE\ 0064}\) where \(a_{2048}\) is the cosine similarity between a pair of words and \(a_{AE\ 0064}\) is the cosine similarity between the exact same words but with the vectors obtained from the Autoencoder having an encoder output dimensions of 64. Each distribution is made up of nearly 130,000 datapoints.


3. Base and Difference distributions in Cosine Similarity Space


Full Code