Production-ready pipeline for extracting Lean theorem/proof pairs, augmenting inputs, and fine-tuning GPT-2 with LoRA adapters.
- Fetches Mathlib4, Lean 4 stdlib, MiniF2F (optional LeanDojo) with caching and license annotations.
- Robust Lean parser that captures theorem/lemma/example statements, proofs, imports, and metadata.
- Input augmentation producing formal headers, math-form reductions, and configurable natural-language paraphrases.
- Hugging Face
DatasetDictbuilder with leakage-safe file hashing and GPT-2 compatible tokenization/masking. - LoRA fine-tuning loop using PEFT + Hugging Face Trainer with early stopping, mixed precision, and rich logging.
- Evaluation script reporting perplexity/token F1 and saving qualitative proof generations.
-
Create environment (Python 3.10–3.12):
conda create -n leanft python=3.11 conda activate leanft
If
pyarrowfails via pip, install it first withconda install -c conda-forge pyarrow. -
Install project:
make setup
-
Run the end-to-end CPU-friendly pipeline on a small subset:
make fetch make extract make augment make build make train
Each command has overridable environment variables (see
Makefile).
scripts/fetch_data.py: clones Lean sources intodata/raw, skipping ones already present.scripts/extract_pairs.py: parses.leanfiles and writesdata/pairs.jsonl.scripts/augment_inputs.py: expands each theorem into multiple(input, target)variants stored indata/aug.jsonl.scripts/build_hf_dataset.py: splits data with file-level hashing and saves a Hugging Face dataset todata/hf.scripts/train_lora.py: loadsconfigs/base.yaml, tokenizes for GPT-2, applies LoRA, and trains.scripts/evaluate.py: loads saved adapters, reports perplexity/token-F1, and writes generations tosamples.txt.
configs/base.yamlcaptures LoRA hyperparameters, dataset paths, device map, and logging backends.- Enable Weights & Biases or TensorBoard by setting
report_to(e.g.,["wandb"]) anduse_wandb: true. - Mixed precision defaults to
nullfor portability; setbf16on Apple Silicon/Ampere orfp16on CUDA.
checkpoints/best: best-performing adapter checkpoint (by perplexity).checkpoints/samples.txt: proof generations for qualitative inspection.- Metrics (perplexity, token F1) printed via
scripts/evaluate.py.
pytestcovers Lean parsing edge cases and dataset masking behaviour (tests/).- Use
make setup && pytestbefore pushing changes or running large jobs.
- See
LICENSES.mdfor upstream licenses (Mathlib4, Lean stdlib, MiniF2F, LeanDojo).