Scenario
- Principal component analysis is a concept developed using techniques form Linear Algebra
- This technique is widely used in Statistics and Sciences to map the higher dimensional data into a lower
dimensional data while retaining as much as variance for the respective features as possible.
- Dimensionality reduction makes usage of large data structures such as embeddings from foundation
models. Let us explore the algorithm here.
Algorithm
- Dimensions \((n_{features}, n_{samples}) \) | Start with the raw embeddings matrix that have to be
projected called as \(X\)
- Dimensions \((n_{features}, n_{samples})\) | Normalize every feature in this dataset \(X_{Std} = \frac{X
- \mu}{\sigma}\) where \(\mu\) is the per feature mean and \(\sigma\) is the per feature standard
deviation.
- Dimensions \((n_{features}, n_{features})\) | Calculate covariance matrix of the above as \[Cov =
\frac{X_{Std}^T \cdot X_{Std}}{n_{samples}-1}\]
- Dimensions \((n_{features}, n_{features})\) | Decompose the \(Cov\) matrix into its eigenvalues and
eigenvectors as \(Cov = W \cdot \Lambda \cdot W^{-1}\) where
- \(\Lambda\) = diagonal matrix of eigen values shown as
\((\lambda_1,\lambda_2,\lambda_3,...)\)
- \(W\) = Matrix of eigenvectors (columns for principal
components)
- Dimensions \((n_{features}, n_{features})\) | Sort Eigen Vectors in the descending order as
\((\lambda_{x1} \ge \lambda_{x2} \ge \lambda_{x3} \ge ...)\)
- Dimensions \((n_{features}, n_{k})\) | Select top \(k\) vectors. Where \(k\) is the dimension of
the preferred projection space and \(W_k\) is the projection matrix.
- Dimensions \((n_{samples}, n_{k})\) | Multiply \(X_{std}\) with \(W_k\) to create the projected matrix
\[X_{projected} = X_{std} \cdot W_k\]
Sources
Original Vectors sourced from the model deepseek-coder-1.3b-base
model_name = "deepseek-ai/deepseek-coder-1.3b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print(f"DeepSeek-Coder model '{model_name}' and tokenizer loaded.")
Top 512 words of the english language have been encoded using the model above
How to determine the best candidate?
- Calculate all combinations pairwise similarities for a subset of the vectors. This exercise samples 1024
vectors
- Pick a ballpark output dimension size
- Experiment with the available algorithms and various output dimensions
- Calculate the similarities in the lower dimensional space
- Subtract the corresponding similarities pairs of higher and lower dimensions
- Plot the distributions
- Pick the method with a distribution centered normally with a mean closest to 0 and with the least
variance
Keys
There are several abbreviations here and the guide below will help with the reading
- Original | 2048 \(\rightarrow\) Raw vectors from the deepseek-coder-1.3b-base model
- PCA | 0016\(\rightarrow\) Principal Component Analysis to hash vectors to dimensions 0016
- PCA | 0128\(\rightarrow\) Principal Component Analysis to hash vectors to dimensions 0128
Observations
From the distributions below, we can observe the following in the similarity space.
- We can see in the distribution charts that just like the Original vectors, the projected vectors are
also exhibiting a similar skewness in the cosine similarity space.
- But the real usefulness of these vectors can be seen in the second distribution chart.
- As we see, the differences are getting smaller and their peaks are getting very close to 0.
Victors?
- From the looks of it, any of the projections above the dimensions of 64 can be used as approximations
- All of them have been highlighted with a green box in the histogram charts
- The histogram outlined in red is the original distribution. The one we are not trying to use.
Distribution Charts
1. Base Distributions in Cosine Similarity Space
2. Difference distributions in Cosine Similarity Space
These are obtained by subtracting pairwise cosine similarities between the original and reduced
dimensions. For example, a given data point would be \(a_{2048} - a_{PCA\ 0064}\) where \(a_{2048}\) is the
cosine similarity between a pair of words and \(a_{PCA\ 0064}\) is the cosine similarity between the exact
same words but with the vectors obtained from the PCA projector having output dimensions of 64. Each
distribution is made up of nearly 130,000 datapoints.
3. Base and Difference distributions in Cosine Similarity Space