This project investigates whether "Inverse Scaling" datasets are resistant to model finetuning. We focused on the redefine-math task, where models must ignore mathematical priors to perform textual operations.
- Not Finetuning Proof: Fine-tuning
TinyLlama-1.1Bimproved accuracy from 46% (Zero-shot) to 80% (LoRA). - Stronger than ICL: Fine-tuning significantly outperformed 5-shot in-context learning (57%).
- Conclusion: Sufficient gradient updates can override strong pre-trained priors.
-
Setup:
uv pip install torch transformers peft datasets bitsandbytes accelerate scikit-learn matplotlib pandas
-
Run Experiment:
python run_finetuning_experiment.py
(Note: This script now includes the robust evaluation logic)
-
Verify: Check
results/redefine_math_robust.json.
run_finetuning_experiment.py: Main training script.evaluate_robust.py: Evaluation script using log-likelihoods.datasets/: Inverse scaling data.results/: Checkpoints and JSON logs.