Algorithms | Principal Component Analysis

Scenario

Principal component analysis is a concept developed using techniques form Linear Algebra
This technique is widely used in Statistics and Sciences to map the higher dimensional data into a lower dimensional data while retaining as much as variance for the respective features as possible.
Dimensionality reduction makes usage of large data structures such as embeddings from foundation models. Let us explore the algorithm here.

Algorithm

Dimensions \((n_{features}, n_{samples}) \) | Start with the raw embeddings matrix that have to be projected called as \(X\)
Dimensions \((n_{features}, n_{samples})\) | Normalize every feature in this dataset \(X_{Std} = \frac{X - \mu}{\sigma}\) where \(\mu\) is the per feature mean and \(\sigma\) is the per feature standard deviation.
Dimensions \((n_{features}, n_{features})\) | Calculate covariance matrix of the above as \[Cov = \frac{X_{Std}^T \cdot X_{Std}}{n_{samples}-1}\]
Dimensions \((n_{features}, n_{features})\) | Decompose the \(Cov\) matrix into its eigenvalues and eigenvectors as \(Cov = W \cdot \Lambda \cdot W^{-1}\) where
- \(\Lambda\) = diagonal matrix of eigen values shown as \((\lambda_1,\lambda_2,\lambda_3,...)\)
- \(W\) = Matrix of eigenvectors (columns for principal components)
Dimensions \((n_{features}, n_{features})\) | Sort Eigen Vectors in the descending order as \((\lambda_{x1} \ge \lambda_{x2} \ge \lambda_{x3} \ge ...)\)
Dimensions \((n_{features}, n_{k})\) | Select top \(k\) vectors. Where \(k\) is the dimension of the preferred projection space and \(W_k\) is the projection matrix.
Dimensions \((n_{samples}, n_{k})\) | Multiply \(X_{std}\) with \(W_k\) to create the projected matrix \[X_{projected} = X_{std} \cdot W_k\]

Sources

Original Vectors sourced from the model deepseek-coder-1.3b-base


    model_name = "deepseek-ai/deepseek-coder-1.3b-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    print(f"DeepSeek-Coder model '{model_name}' and tokenizer loaded.")

Top 512 words of the english language have been encoded using the model above

How to determine the best candidate?

Calculate all combinations pairwise similarities for a subset of the vectors. This exercise samples 1024 vectors
Pick a ballpark output dimension size
Experiment with the available algorithms and various output dimensions
Calculate the similarities in the lower dimensional space
Subtract the corresponding similarities pairs of higher and lower dimensions
Plot the distributions
Pick the method with a distribution centered normally with a mean closest to 0 and with the least variance

Keys

There are several abbreviations here and the guide below will help with the reading

Original | 2048 \(\rightarrow\) Raw vectors from the deepseek-coder-1.3b-base model
PCA | 0016\(\rightarrow\) Principal Component Analysis to hash vectors to dimensions 0016
PCA | 0128\(\rightarrow\) Principal Component Analysis to hash vectors to dimensions 0128

Observations

From the distributions below, we can observe the following in the similarity space.

We can see in the distribution charts that just like the Original vectors, the projected vectors are also exhibiting a similar skewness in the cosine similarity space.
But the real usefulness of these vectors can be seen in the second distribution chart.
As we see, the differences are getting smaller and their peaks are getting very close to 0.

Victors?

From the looks of it, any of the projections above the dimensions of 64 can be used as approximations
All of them have been highlighted with a green box in the histogram charts
The histogram outlined in red is the original distribution. The one we are not trying to use.

Distribution Charts

1. Base Distributions in Cosine Similarity Space

2. Difference distributions in Cosine Similarity Space

These are obtained by subtracting pairwise cosine similarities between the original and reduced dimensions. For example, a given data point would be \(a_{2048} - a_{PCA\ 0064}\) where \(a_{2048}\) is the cosine similarity between a pair of words and \(a_{PCA\ 0064}\) is the cosine similarity between the exact same words but with the vectors obtained from the PCA projector having output dimensions of 64. Each distribution is made up of nearly 130,000 datapoints.