AnesBenmerzoug · AnesBenmerzoug · Nov 24, 2024 · Nov 23, 2024 · Nov 23, 2024 · Nov 23, 2024
diff --git a/README.md b/README.md
@@ -30,6 +30,8 @@ The library currently implements the following methods:
 - [CLP-Transfer: Efficient language model training through cross-lingual and progressive transfer learning.](https://arxiv.org/abs/2301.09626) Ostendorff, Malte, and Georg Rehm. arXiv preprint arXiv:2301.09626 (2023).
 - [FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language.](https://arxiv.org/abs/2305.14481) Dobler, Konstantin, and Gerard de Melo. arXiv preprint arXiv:2305.14481 (2023).
 
+Langsfer is flexible enough to allow mixing and matching strategies between different embedding initialization schemes. For example, you can combine fuzzy token overlap with the CLP-Transfer method to refine the initialization process based on fuzzy matches between source and target tokens. This flexibility enables you to experiment with a variety of strategies for different language transfer tasks, making it easier to fine-tune models for your specific use case.
+
 ## Quick Start
 
 ### Installation
@@ -79,6 +81,7 @@ source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
 target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")
 
 source_model = AutoModel.from_pretrained("roberta-base")
+# For models with non-tied embeddings you can choose whether you should transfer the input and output embeddings separately.
 source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()
 
 source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
@@ -105,8 +108,7 @@ To initialize the target embeddings you would then use:
 target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)
 ```
 
-The result is an object of type `TransformersEmbeddings` that contain the initialized
-embeddings in its `embeddings_matrix` field and the target tokenizer in its `tokenizer` field.
+The result is a 2D arrays that contains the initialized embeddings matrix for the target language model.
 
 We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:
 
@@ -123,6 +125,30 @@ target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddi
 target_model.save_pretrained("path/to/target_model")
 ```
 
+## Roadmap
+
+Here are some of the planned developments for Langsfer:
+
+- **Performance Optimization**: Improve the efficiency and usability of the library to streamline workflows
+  and improve computational performance.
+
+- **Model Training & Hugging Face Hub Publishing**: Train both small and large models with embeddings initialized using Langsfer
+  and publish the resulting models to the Hugging Face Hub for public access and use.
+
+- **Parameter-Efficient Fine-Tuning**: Investigate using techniques such as LoRA (Low-Rank Adaptation)
+  to enable parameter-efficient fine-tuning, making it easier to adapt models to specific languages with minimal overhead.
+
+- **Implement New Methods**: Extend Langsfer with additional language transfer methods, including:
+
+  - [Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining.](https://arxiv.org/abs/2311.08849)
+    Liu, Y., Lin, P., Wang, M. and Schütze, H., 2023. arXiv preprint arXiv:2311.08849.
+  - [Zero-Shot Tokenizer Transfer.](https://arxiv.org/abs/2405.07883)
+    Minixhofer, B., Ponti, E.M. and Vulić, I., 2024. arXiv preprint arXiv:2405.07883.
+
+- **Comprehensive Benchmarking**: Run extensive benchmarks across all implemented methods to evaluate their performance, identify strengths
+  and weaknesses, and compare results to establish best practices for language transfer.
+
+
 ## Contributing
 
 Refer to the [contributing guide](CONTRIBUTING.md) for instructions on you can make contributions to this repository.

diff --git a/src/langsfer/alignment.py b/src/langsfer/alignment.py
@@ -1,3 +1,8 @@
+"""This module provides strategies for aligning embedding matrices using different techniques.
+
+The `AlignmentStrategy` class is an abstract base class that defines the interface for embedding alignment strategies.
+"""
+
 import logging
 import os
 import warnings
@@ -16,18 +21,47 @@
 
 
 class AlignmentStrategy(ABC):
+    """Abstract base class for defining strategies to align embedding matrices.
+
+    Subclasses must implement the `apply` method to define the logic for aligning
+    the embedding matrix based on their specific alignment technique.
+    """
+
     @abstractmethod
     def apply(self, embedding_matrix: NDArray) -> NDArray: ...
 
 
 class IdentityAlignment(AlignmentStrategy):
+    """Alignment strategy that does not alter the input embedding matrix.
+
+    This strategy simply returns the input embedding matrix unchanged.
+
+    Example:
+        >>> identity_alignment = IdentityAlignment()
+        >>> aligned_embeddings = identity_alignment.apply(embedding_matrix)
+        >>> # aligned_embeddings will be the same as embedding_matrix
+    """
+
     def apply(self, embedding_matrix: NDArray) -> NDArray:
+        """Returns the input embedding matrix unchanged.
+
+        Args:
+            embedding_matrix: 2D embedding matrix to be aligned.
+
+        Returns:
+            The same embedding matrix as the output, without any modifications.
+        """
         return embedding_matrix
 
 
 class BilingualDictionaryAlignment(AlignmentStrategy):
     """Alignment strategy that uses a bilingual dictionary to compute the alignment matrix.
 
+    This strategy uses word pairs from a bilingual dictionary to compute an alignment
+    matrix between the source and target embedding matrices. The dictionary maps words in the
+    source language to words in the target language. The alignment matrix is computed by
+    applying orthogonal Procrustes analysis to the word vector correspondences.
+
     The bilingual dictionary maps words in the source language to words in the target language
     and is expected to be of the form:
 
@@ -39,10 +73,10 @@ class BilingualDictionaryAlignment(AlignmentStrategy):
     ```
 
     Args:
-        source_word_embeddings: Word embeddings of the source language
-        target_word_embeddings: Word embeddings of the target language
-        bilingual_dictionary: Dictionary mapping words in source language to words in target language
-        bilingual_dictionary_file: Path to a bilingual dictionary file
+        source_word_embeddings: Word embeddings of the source language.
+        target_word_embeddings: Word embeddings of the target language.
+        bilingual_dictionary: Dictionary mapping words in source language to words in target language.
+        bilingual_dictionary_file: Path to a bilingual dictionary file containing word pairs.
     """
 
     def __init__(
@@ -75,6 +109,23 @@ def __init__(
     def _load_bilingual_dictionary(
         file_path: str | os.PathLike,
     ) -> dict[str, list[str]]:
+        """Loads a bilingual dictionary from a file.
+
+        The file is expected to contain word pairs, one per line, separated by tabs, e.g.:
+
+        ```
+        english_word1 \t target_word1\n
+        english_word2 \t target_word2\n
+        ...
+        english_wordn \t target_wordn\n
+        ```
+
+        Args:
+            file_path: Path to the bilingual dictionary file.
+
+        Returns:
+            A dictionary where the keys are source language words, and the values are lists of target language words.
+        """
         bilingual_dictionary: dict[str, list[str]] = {}
 
         for line in open(file_path):
@@ -91,6 +142,15 @@ def _load_bilingual_dictionary(
         return bilingual_dictionary
 
     def _compute_alignment_matrix(self) -> NDArray:
+        """Computes the alignment matrix using the bilingual dictionary.
+
+        The method iterates over the bilingual dictionary, retrieving word vector correspondences from the
+        source and target language embeddings. It uses orthogonal Procrustes analysis to compute the
+        transformation matrix that aligns the source word embeddings with the target word embeddings.
+
+        Returns:
+            A 2D array representing the alignment matrix.
+        """
         logger.info(
             "Computing word embedding alignment matrix from bilingual dictionary"
         )
@@ -145,6 +205,17 @@ def _compute_alignment_matrix(self) -> NDArray:
         return alignment_matrix
 
     def apply(self, embedding_matrix: NDArray) -> NDArray:
+        """Applies the computed alignment matrix to the given embedding matrix.
+
+        The embedding matrix is transformed by multiplying it with the alignment matrix
+        obtained from the bilingual dictionary.
+
+        Args:
+            embedding_matrix: 2D embedding matrix to be aligned.
+
+        Returns:
+            Aligned embedding matrix.
+        """
         alignment_matrix = self._compute_alignment_matrix()
         aligned_embedding_matrix = embedding_matrix @ alignment_matrix
         return aligned_embedding_matrix