Skip to content

Nootka-io/nooForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NooForge

Efficient tokenization and training of huggingface transformers models.

The core of nooForge is the Packing algorthim

NOTE: This is an alpa release (and I would barely call it that). The code works, however it's lacking any documentation, packaging, and requirements. Some more details and a usage example are available here: https://github.com/getorca/stock_price_chat

Packing & Tokenization

This is the core innovation of NooForge, and offers a large benefit for utilization of compute power for training.

Tokenization is done before training in a isolated step. This is done to effieciently pack the training samples, and create more seperation of concerns. It's largely handled by duckdb and uses Jinja2 templates for constructing samples.

python noo_tokenizer.py --config ./training_scripts/stc_config.yml

example of tokenizer config: https://github.com/getorca/stock_price_chat/blob/main/training_scripts/stc_config.yml

Templating

Uses PyYaml (YAML 1.1 support) to parse the YAML file and Jinja2 to render the templates.

See https://yaml.org/spec/1.1/ for more info on yaml syntax.

See https://jinja.palletsprojects.com/en/3.1.x/api/ for more on Jinja2 syntax.

Training

Usage example: https://github.com/getorca/stock_price_chat/blob/main/training_scripts/finetune_spc_01.sh

Citations / Credits

ToDos

  • training
  • tokenization
    • more felxibility

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages