Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve code documentation #10

Merged
merged 8 commits into from
Nov 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ The library currently implements the following methods:
- [CLP-Transfer: Efficient language model training through cross-lingual and progressive transfer learning.](https://arxiv.org/abs/2301.09626) Ostendorff, Malte, and Georg Rehm. arXiv preprint arXiv:2301.09626 (2023).
- [FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language.](https://arxiv.org/abs/2305.14481) Dobler, Konstantin, and Gerard de Melo. arXiv preprint arXiv:2305.14481 (2023).

Langsfer is flexible enough to allow mixing and matching strategies between different embedding initialization schemes. For example, you can combine fuzzy token overlap with the CLP-Transfer method to refine the initialization process based on fuzzy matches between source and target tokens. This flexibility enables you to experiment with a variety of strategies for different language transfer tasks, making it easier to fine-tune models for your specific use case.

## Quick Start

### Installation
Expand Down Expand Up @@ -79,6 +81,7 @@ source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")

source_model = AutoModel.from_pretrained("roberta-base")
# For models with non-tied embeddings you can choose whether you should transfer the input and output embeddings separately.
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()

source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
Expand All @@ -105,8 +108,7 @@ To initialize the target embeddings you would then use:
target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)
```

The result is an object of type `TransformersEmbeddings` that contain the initialized
embeddings in its `embeddings_matrix` field and the target tokenizer in its `tokenizer` field.
The result is a 2D arrays that contains the initialized embeddings matrix for the target language model.

We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:

Expand All @@ -123,6 +125,30 @@ target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddi
target_model.save_pretrained("path/to/target_model")
```

## Roadmap

Here are some of the planned developments for Langsfer:

- **Performance Optimization**: Improve the efficiency and usability of the library to streamline workflows
and improve computational performance.

- **Model Training & Hugging Face Hub Publishing**: Train both small and large models with embeddings initialized using Langsfer
and publish the resulting models to the Hugging Face Hub for public access and use.

- **Parameter-Efficient Fine-Tuning**: Investigate using techniques such as LoRA (Low-Rank Adaptation)
to enable parameter-efficient fine-tuning, making it easier to adapt models to specific languages with minimal overhead.

- **Implement New Methods**: Extend Langsfer with additional language transfer methods, including:

- [Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining.](https://arxiv.org/abs/2311.08849)
Liu, Y., Lin, P., Wang, M. and Schütze, H., 2023. arXiv preprint arXiv:2311.08849.
- [Zero-Shot Tokenizer Transfer.](https://arxiv.org/abs/2405.07883)
Minixhofer, B., Ponti, E.M. and Vulić, I., 2024. arXiv preprint arXiv:2405.07883.

- **Comprehensive Benchmarking**: Run extensive benchmarks across all implemented methods to evaluate their performance, identify strengths
and weaknesses, and compare results to establish best practices for language transfer.


## Contributing

Refer to the [contributing guide](CONTRIBUTING.md) for instructions on you can make contributions to this repository.
Expand Down
79 changes: 75 additions & 4 deletions src/langsfer/alignment.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
"""This module provides strategies for aligning embedding matrices using different techniques.

The `AlignmentStrategy` class is an abstract base class that defines the interface for embedding alignment strategies.
"""

import logging
import os
import warnings
Expand All @@ -16,18 +21,47 @@


class AlignmentStrategy(ABC):
"""Abstract base class for defining strategies to align embedding matrices.

Subclasses must implement the `apply` method to define the logic for aligning
the embedding matrix based on their specific alignment technique.
"""

@abstractmethod
def apply(self, embedding_matrix: NDArray) -> NDArray: ...


class IdentityAlignment(AlignmentStrategy):
"""Alignment strategy that does not alter the input embedding matrix.

This strategy simply returns the input embedding matrix unchanged.

Example:
>>> identity_alignment = IdentityAlignment()
>>> aligned_embeddings = identity_alignment.apply(embedding_matrix)
>>> # aligned_embeddings will be the same as embedding_matrix
"""

def apply(self, embedding_matrix: NDArray) -> NDArray:
"""Returns the input embedding matrix unchanged.

Args:
embedding_matrix: 2D embedding matrix to be aligned.

Returns:
The same embedding matrix as the output, without any modifications.
"""
return embedding_matrix


class BilingualDictionaryAlignment(AlignmentStrategy):
"""Alignment strategy that uses a bilingual dictionary to compute the alignment matrix.

This strategy uses word pairs from a bilingual dictionary to compute an alignment
matrix between the source and target embedding matrices. The dictionary maps words in the
source language to words in the target language. The alignment matrix is computed by
applying orthogonal Procrustes analysis to the word vector correspondences.

The bilingual dictionary maps words in the source language to words in the target language
and is expected to be of the form:

Expand All @@ -39,10 +73,10 @@ class BilingualDictionaryAlignment(AlignmentStrategy):
```

Args:
source_word_embeddings: Word embeddings of the source language
target_word_embeddings: Word embeddings of the target language
bilingual_dictionary: Dictionary mapping words in source language to words in target language
bilingual_dictionary_file: Path to a bilingual dictionary file
source_word_embeddings: Word embeddings of the source language.
target_word_embeddings: Word embeddings of the target language.
bilingual_dictionary: Dictionary mapping words in source language to words in target language.
bilingual_dictionary_file: Path to a bilingual dictionary file containing word pairs.
"""

def __init__(
Expand Down Expand Up @@ -75,6 +109,23 @@ def __init__(
def _load_bilingual_dictionary(
file_path: str | os.PathLike,
) -> dict[str, list[str]]:
"""Loads a bilingual dictionary from a file.

The file is expected to contain word pairs, one per line, separated by tabs, e.g.:

```
english_word1 \t target_word1\n
english_word2 \t target_word2\n
...
english_wordn \t target_wordn\n
```

Args:
file_path: Path to the bilingual dictionary file.

Returns:
A dictionary where the keys are source language words, and the values are lists of target language words.
"""
bilingual_dictionary: dict[str, list[str]] = {}

for line in open(file_path):
Expand All @@ -91,6 +142,15 @@ def _load_bilingual_dictionary(
return bilingual_dictionary

def _compute_alignment_matrix(self) -> NDArray:
"""Computes the alignment matrix using the bilingual dictionary.

The method iterates over the bilingual dictionary, retrieving word vector correspondences from the
source and target language embeddings. It uses orthogonal Procrustes analysis to compute the
transformation matrix that aligns the source word embeddings with the target word embeddings.

Returns:
A 2D array representing the alignment matrix.
"""
logger.info(
"Computing word embedding alignment matrix from bilingual dictionary"
)
Expand Down Expand Up @@ -145,6 +205,17 @@ def _compute_alignment_matrix(self) -> NDArray:
return alignment_matrix

def apply(self, embedding_matrix: NDArray) -> NDArray:
"""Applies the computed alignment matrix to the given embedding matrix.

The embedding matrix is transformed by multiplying it with the alignment matrix
obtained from the bilingual dictionary.

Args:
embedding_matrix: 2D embedding matrix to be aligned.

Returns:
Aligned embedding matrix.
"""
alignment_matrix = self._compute_alignment_matrix()
aligned_embedding_matrix = embedding_matrix @ alignment_matrix
return aligned_embedding_matrix
Loading