Small project on algorithms for piecewise orthogonal alignment of vectors and an application to vector embeddings. The main contribution is a dynamic-programming algorithm which finds the optimal clustering (along a given 1-dimensional ordering of the data) and cluster-wise orthogonal transformations to minimize the squared alignment error. See writeup.
Visualization of the DP algorithm:
Install dependencies via pip install -r requirements.txt.
experiments/upstream/main_evaluation.py, generates a synthetic
temporal-shift dataset that is favorable to piecewise alignment and compares baseline approaches (global_procrustes, kmeans,
with the DP solver).
python experiments/upstream/main_evaluation.py experiments/downstream/partial_upgrade.py evaluates a partial upgrade: documents are indexed with Model A, queries come from Model B, and we use vector alignment to learn a map from Model B to Model A that we apply before retrieval. The script reports Recall on held-out pairs and train/test residuals.
python experiments/downstream/partial_upgrade.py The DP solver performs a quadratic number of nuclear norm computations. By default, the nuclear norm is computed exactly, but we also include a fast approximation of the nuclear norm using Stochastic Lanczos Quadrature (SLQ). Pass --nvecs or --steps to tune the parameters for fast approximation.
algorithms/: implementation of (baseline_procrustes,piecewise_procrustes,fast_nuclear_norm)experiments/upstream/: Alignment error and runtime on a synthetic dataset + algo visualizationexperiments/downstream/: a retrieval task where vector alignment improves accuracy
-
Runtime. Save on the
$n^2$ runtime factor with a better algorithmic idea (e.g., sketching the Frobenius norm as a proxy) -
Quality. Find a setting / downstream task where
piecewise_procrustescan yield substantially better results than naive methods -- currently, it doesn't look too promising
