Skip to content

Latest commit

 

History

History
209 lines (155 loc) · 9.63 KB

README.md

File metadata and controls

209 lines (155 loc) · 9.63 KB

Langsfer Logo

Langsfer, a library for language transfer methods and algorithms.

Main CI Workflow Status PyPI - Version TestPyPI - Version PyPI - License PyPI - Python Version

Language transfer refers to a few related things:

  • initializing a Large Language Model (LLM) in a new, typically low-resource, target language (e.g. German, Arabic) from another LLM trained in high-resource source language (e.g. English),
  • extending the vocabulary of an LLM by adding new tokens and initializing their embeddings in a manner that allows them to be used with little to no extra training,
  • specializing the vocabulary of a multilingual LLM to one of its supported languages.

The library currently implements the following methods:

Langsfer is flexible enough to allow mixing and matching strategies between different embedding initialization schemes. For example, you can combine fuzzy token overlap with the CLP-Transfer method to refine the initialization process based on fuzzy matches between source and target tokens. This flexibility enables you to experiment with a variety of strategies for different language transfer tasks, making it easier to fine-tune models for your specific use case.

Quick Start

Installation

To install the latest stable version from PyPI use:

pip install langsfer

To install the latest development version from TestPyPI use:

pip install -i https://test.pypi.org/simple/ langsfer

To install the latest development version from the repository use:

git clone [email protected]:AnesBenmerzoug/langsfer.git
cd langsfer
pip install .

Tutorials

The following notebooks serve as tutorials for users of the package:

Simple Example

The package provide high-level interfaces to instantiate each of the methods, without worrying too much about the package's internals.

For example, to use the WECHSEL method, you would use:

from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file
from transformers import AutoTokenizer

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")

source_model = AutoModel.from_pretrained("roberta-base")
# For models with non-tied embeddings you can choose whether you should transfer the input and output embeddings separately.
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()

source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")

bilingual_dictionary_file = download_file(
    "https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
    "german.txt",
)

embedding_initializer = wechsel(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
    bilingual_dictionary_file=bilingual_dictionary_file,
)

To initialize the target embeddings you would then use:

target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

The result is a 2D arrays that contains the initialized embeddings matrix for the target language model.

We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:

import torch
from transformers import AutoModel

target_model = AutoModel.from_pretrained("roberta-base")
# Resize its embedding layer
target_model.resize_token_embeddings(len(target_tokenizer))
# Replace the source embeddings matrix with the target embeddings matrix
target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddings_matrix)
# Save the new model
target_model.save_pretrained("path/to/target_model")

Advanced Example

Langsfer also provides lower-level interfaces to allow you to tweak many of the components of the embedding initialiation. You however have to know a bit more about the package's internals.

For example, if you want use to replace the WECHSEL method's weight strategy and token overlap strategy with Sparsemax and Fuzzy token overalp, respectively, you would use:

from langsfer.initialization import WeightedAverageEmbeddingsInitialization
from langsfer.alignment import BilingualDictionaryAlignment
from langsfer.embeddings import FastTextEmbeddings
from langsfer.weights import SparsemaxWeights
from langsfer.token_overlap import FuzzyMatchTokenOverlap

embeddings_initializer = WeightedAverageEmbeddingsInitialization(
  source_tokenizer=source_tokenizer,
  source_embeddings_matrix=source_embeddings_matrix,
  target_tokenizer=target_tokenizer,
  target_auxiliary_embeddings=target_auxiliary_embeddings,
  source_auxiliary_embeddings=source_auxiliary_embeddings,
  alignment_strategy=BilingualDictionaryAlignment(
      source_auxiliary_embeddings,
      target_auxiliary_embeddings,
      bilingual_dictionary_file=bilingual_dictionary_file,
  ),
  weights_strategy=SprasemaxWeights(),
  token_overlap_strategy=FuzzyMatchTokenOverlap(),
)

You could even implement your own strategies for token overlap computation, embedding alignement, similarity score compuation and weight computation.

Roadmap

Here are some of the planned developments for Langsfer:

  • Performance Optimization: Improve the efficiency and usability of the library to streamline workflows and improve computational performance.

  • Model Training & Hugging Face Hub Publishing: Train both small and large models with embeddings initialized using Langsfer and publish the resulting models to the Hugging Face Hub for public access and use.

  • Parameter-Efficient Fine-Tuning: Investigate using techniques such as LoRA (Low-Rank Adaptation) to enable parameter-efficient fine-tuning, making it easier to adapt models to specific languages with minimal overhead.

  • Implement New Methods: Extend Langsfer with additional language transfer methods, including:

  • Comprehensive Benchmarking: Run extensive benchmarks across all implemented methods to evaluate their performance, identify strengths and weaknesses, and compare results to establish best practices for language transfer.

Contributing

Refer to the contributing guide for instructions on you can make contributions to this repository.

Logo

The langsfer logo was created by my good friend Zakaria Taleb Hacine, a 3D artist with industry experience and a packed portfolio.

The logo contains the latin alphabet letters A and I which are an acronym for Artificial Intelligence and the arabic alphabet letters أ and ذ which are an acronym for ذكاء اصطناعي, which is Artificial Intelligence in arabic.

The fonts used are Ethnocentric Regular and Readex Pro.

License

This package is license under the LGPL-2.1 license.