From e9c36422d85dce7faf1132961b7030f65f17ba94 Mon Sep 17 00:00:00 2001 From: Jan Provaznik Date: Thu, 9 May 2024 16:50:21 +0200 Subject: [PATCH] reorganize repo for thesis submission --- README.md | 55 +++++++++++++------ ...s_analysis.ipynb => evo_correlation.ipynb} | 0 .../21_vignere3_noisy_random_news_en.ipynb | 2 +- .../22_vignere3_noisy_random_news_de.ipynb | 2 +- .../23_vignere3_noisy_random_news_cs.ipynb | 2 +- .../24_const_noisy_enigma_news_cs.ipynb | 2 +- .../25_const_noisy_enigma_news_de.ipynb | 2 +- .../26_const_noisy_enigma_news_en.ipynb | 2 +- .../{ => unused}/01_copy_random_text.ipynb | 0 reproducible/{ => unused}/02_copy_news.ipynb | 0 .../{ => unused}/03_caesar_random_text.ipynb | 0 .../{ => unused}/04_caesar_news.ipynb | 0 .../{ => unused}/05_triple_caesar_news.ipynb | 0 .../06_all_caesar_hint_random_text.ipynb | 0 .../07_all_caesar_hint_news.ipynb | 0 .../{ => unused}/08_all_caesar_news.ipynb | 0 .../{ => unused}/09_vignere2_news.ipynb | 0 .../{ => unused}/10_vignere3_news.ipynb | 0 .../{ => unused}/11_vignere_long_news.ipynb | 0 .../12_vignere_multiple_news.ipynb | 0 .../{ => unused}/13_vignere_random_news.ipynb | 0 .../{ => unused}/14_const_enigma_news.ipynb | 0 .../{ => unused}/15_rigged_caesar_news.ipynb | 0 .../16_const_enigma_news_cs.ipynb | 0 .../17_const_enigma_news_de.ipynb | 0 .../18_const_noisy_enigma_news_de.ipynb | 0 .../19_const_noisy_enigma_news_en.ipynb | 0 .../20_vignere_noisy_random_news_en.ipynb | 0 .../{ => unused}/load_and_explore_model.ipynb | 0 29 files changed, 43 insertions(+), 24 deletions(-) rename analysis/{train_dynamics_analysis.ipynb => evo_correlation.ipynb} (100%) rename reproducible/{ => unused}/01_copy_random_text.ipynb (100%) rename reproducible/{ => unused}/02_copy_news.ipynb (100%) rename reproducible/{ => unused}/03_caesar_random_text.ipynb (100%) rename reproducible/{ => unused}/04_caesar_news.ipynb (100%) rename reproducible/{ => unused}/05_triple_caesar_news.ipynb (100%) rename reproducible/{ => unused}/06_all_caesar_hint_random_text.ipynb (100%) rename reproducible/{ => unused}/07_all_caesar_hint_news.ipynb (100%) rename reproducible/{ => unused}/08_all_caesar_news.ipynb (100%) rename reproducible/{ => unused}/09_vignere2_news.ipynb (100%) rename reproducible/{ => unused}/10_vignere3_news.ipynb (100%) rename reproducible/{ => unused}/11_vignere_long_news.ipynb (100%) rename reproducible/{ => unused}/12_vignere_multiple_news.ipynb (100%) rename reproducible/{ => unused}/13_vignere_random_news.ipynb (100%) rename reproducible/{ => unused}/14_const_enigma_news.ipynb (100%) rename reproducible/{ => unused}/15_rigged_caesar_news.ipynb (100%) rename reproducible/{ => unused}/16_const_enigma_news_cs.ipynb (100%) rename reproducible/{ => unused}/17_const_enigma_news_de.ipynb (100%) rename reproducible/{ => unused}/18_const_noisy_enigma_news_de.ipynb (100%) rename reproducible/{ => unused}/19_const_noisy_enigma_news_en.ipynb (100%) rename reproducible/{ => unused}/20_vignere_noisy_random_news_en.ipynb (100%) rename reproducible/{ => unused}/load_and_explore_model.ipynb (100%) diff --git a/README.md b/README.md index 52ab057..ab03981 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,11 @@ # Enigma Transformed ## Abstract -This project explores the possibility of using a pretrained language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by varying the test set and measuring accuracy. - +We explore the possibility of using a pre-trained Transformer language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by measuring correlations of error rates. +1. create evaluation dataset with linguistic properties +2. train model on decipherment +3. evaluate correlations and predictability from linguistic properties ## Docs ### How to run @@ -14,7 +16,7 @@ pip install -e . ``` #### Slurm cluster - basic setting: `sbatch -p gpu -c1 --gpus=1 --mem=16G ` -- use `run_notebook.sh ` to run a Jupyter notebook on a slurm cluster +- use `./run_notebook.sh ` to run a Jupyter notebook on a slurm cluster #### Colab - clone this repo and use the desired `.ipynb` files @@ -23,17 +25,16 @@ pip install -e . !git clone https://github.com/JanProvaznik/enigma-transformed !pip install transformers[torch] Levenshtein py-enigma ``` -### Meta info -- uses lowercased letters in all experiments -- usually preserving spaces, punctuation in some -- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset -- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation -- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/) -- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers ### Source code #### reproducible/ -- for each experiment contians a notebook that can be used to reproduce it in a readable manner +- scripts for fine-tuning ByT5 on ciphers +- Used in thesis: + - `21_vignere3_noisy_random_news_en.ipynb`, `22_vignere3_noisy_random_news_de.ipynb`, `23_vignere3_noisy_random_news_cs.ipynb` finetuning ByT5 to decrypt a random 3-letter key Vignere cipher on news sentences + - `24_const_noisy_enigma_news_cs.ipynb`, `25_const_noisy_enigma_news_de.ipynb`, `26_const_noisy_enigma_news_en.ipynb` finetuning ByT5 to decrypt a simplified Enigma cipher on news sentences + + +- old experiments in `unused/` - `01_copy_random_text.ipynb` - trains model to copy on random strings - `02_copy_news.ipynb` - trains model to copy on news sentences - `03_caesar_random_text.ipynb` - trains model to decrypt constant caesar cipher (only one setting) on random strings @@ -47,10 +48,22 @@ pip install -e . - `10_vignere3_news` - trains model to decrypt constant 3 letter vignere cipher on news sentences - `11_vignere_long_news` - trains model to decrypt constant vignere cipher with key 'helloworld' on news sentences - `12_vignere_multiple_news` - trains model to decrypt 2 letter vignere cipher, with 3 settings on news sentences + - ... and more + +#### data/ +- `weird_classify.ipynb` and `lang_classify.ipynb` - filter out sentences +- `measure_dataset(cs,de).ipynb` - annotate linguistic properties of a dataset +- `evaluation_batchedgpuevaluate_other_models.ipynb` - inference decipherments by different model checkpoints +#### analysis/ +- `loss_curves.ipynb` - visualize loss curves of training with error density at checkpoints +- `corr_matrices.ipynb` - to create correlation matrices of error rates and linguistic properties +- `evo_correlation.ipynb` - to graph of evolution of correlations of error rates and linguistic properties +- `pred_shap.ipynb` - predict error rates with simple ML and analyze with shap -#### run_notebook.sh -- script for running a notebook on a slurm cluster + +#### run_notebook.sh and run_notebook4gpu.sh +- script for running training or inference notebooks on a slurm cluster with GPUs #### src/ @@ -77,8 +90,7 @@ pip install -e . ##### `lens_train.py` - script to replicate reproducible/03 with [TransformerLens](https://github.com/neelnanda-io/TransformerLens) library and minimal amount of resources (only 1 layer transformer) - -### Usual experiment pipeline +### What happens when training 0. get data from the internet or generate it 1. filter the data for the given experiment (e.g. only sentences 100-200 characters long) 2. preprocess the data: only a-z + spaces, trim/pad to desired length @@ -87,19 +99,26 @@ pip install -e . 5. train the model on the training pairs 6. save the model 7. evaluate the performance of the model (during training and after training) - - e.g. edit distances + +### Meta info +- uses lowercased letters in all experiments +- vigenere preserves spaces, enigma replaces them with X +- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset +- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation +- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/) +- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers ### Training hyperparameters: #### number of training examples - the more the better (if model sees all cipher configuations it won't have to generalize the cipher procedure, but only detect which configuation is used and apply it) #### trainable parameters in model -- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time to use big batch sizes +- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time using big batch sizes #### epochs - the more the better, but we're limited by the time we have #### batch size -- if too low, model won't be able to learn any patterns +- if too low, models won't be able to learn any patterns - generally the higher the better, but we're limited by the GPU memory - trick: use gradient accumulation - e.g. if we have batch size 16 and gradient accumulation 16 -> the effective batch size is 256 diff --git a/analysis/train_dynamics_analysis.ipynb b/analysis/evo_correlation.ipynb similarity index 100% rename from analysis/train_dynamics_analysis.ipynb rename to analysis/evo_correlation.ipynb diff --git a/reproducible/21_vignere3_noisy_random_news_en.ipynb b/reproducible/21_vignere3_noisy_random_news_en.ipynb index c19d9d4..3aa21f8 100644 --- a/reproducible/21_vignere3_noisy_random_news_en.ipynb +++ b/reproducible/21_vignere3_noisy_random_news_en.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Vignere cipher (all possible settings, length 3) on news dataset" + "# Vignere cipher (all possible settings, length 3) on EN news dataset" ] }, { diff --git a/reproducible/22_vignere3_noisy_random_news_de.ipynb b/reproducible/22_vignere3_noisy_random_news_de.ipynb index b034a12..08558a3 100644 --- a/reproducible/22_vignere3_noisy_random_news_de.ipynb +++ b/reproducible/22_vignere3_noisy_random_news_de.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Vignere cipher (all possible settings, length 3) on news dataset" + "# Vignere cipher (all possible settings, length 3) on DE news dataset" ] }, { diff --git a/reproducible/23_vignere3_noisy_random_news_cs.ipynb b/reproducible/23_vignere3_noisy_random_news_cs.ipynb index 7cd6950..4b24934 100644 --- a/reproducible/23_vignere3_noisy_random_news_cs.ipynb +++ b/reproducible/23_vignere3_noisy_random_news_cs.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Vignere cipher (all possible settings, length 3) on news dataset" + "# Vignere cipher (all possible settings, length 3) on CS news dataset" ] }, { diff --git a/reproducible/24_const_noisy_enigma_news_cs.ipynb b/reproducible/24_const_noisy_enigma_news_cs.ipynb index 8716a59..41071d3 100644 --- a/reproducible/24_const_noisy_enigma_news_cs.ipynb +++ b/reproducible/24_const_noisy_enigma_news_cs.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Vignere cipher (all possible settings, length 3) on news dataset" + "# Enigma cipher on CS news dataset" ] }, { diff --git a/reproducible/25_const_noisy_enigma_news_de.ipynb b/reproducible/25_const_noisy_enigma_news_de.ipynb index c2e2cff..ac4ff97 100644 --- a/reproducible/25_const_noisy_enigma_news_de.ipynb +++ b/reproducible/25_const_noisy_enigma_news_de.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Vignere cipher (all possible settings, length 3) on news dataset" + "# Enigma cipher on DE news dataset" ] }, { diff --git a/reproducible/26_const_noisy_enigma_news_en.ipynb b/reproducible/26_const_noisy_enigma_news_en.ipynb index 384cbdb..c803b58 100644 --- a/reproducible/26_const_noisy_enigma_news_en.ipynb +++ b/reproducible/26_const_noisy_enigma_news_en.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "\n", - "# Const noisy enigma on english" + "# Const noisy enigma on english news dataset" ] }, { diff --git a/reproducible/01_copy_random_text.ipynb b/reproducible/unused/01_copy_random_text.ipynb similarity index 100% rename from reproducible/01_copy_random_text.ipynb rename to reproducible/unused/01_copy_random_text.ipynb diff --git a/reproducible/02_copy_news.ipynb b/reproducible/unused/02_copy_news.ipynb similarity index 100% rename from reproducible/02_copy_news.ipynb rename to reproducible/unused/02_copy_news.ipynb diff --git a/reproducible/03_caesar_random_text.ipynb b/reproducible/unused/03_caesar_random_text.ipynb similarity index 100% rename from reproducible/03_caesar_random_text.ipynb rename to reproducible/unused/03_caesar_random_text.ipynb diff --git a/reproducible/04_caesar_news.ipynb b/reproducible/unused/04_caesar_news.ipynb similarity index 100% rename from reproducible/04_caesar_news.ipynb rename to reproducible/unused/04_caesar_news.ipynb diff --git a/reproducible/05_triple_caesar_news.ipynb b/reproducible/unused/05_triple_caesar_news.ipynb similarity index 100% rename from reproducible/05_triple_caesar_news.ipynb rename to reproducible/unused/05_triple_caesar_news.ipynb diff --git a/reproducible/06_all_caesar_hint_random_text.ipynb b/reproducible/unused/06_all_caesar_hint_random_text.ipynb similarity index 100% rename from reproducible/06_all_caesar_hint_random_text.ipynb rename to reproducible/unused/06_all_caesar_hint_random_text.ipynb diff --git a/reproducible/07_all_caesar_hint_news.ipynb b/reproducible/unused/07_all_caesar_hint_news.ipynb similarity index 100% rename from reproducible/07_all_caesar_hint_news.ipynb rename to reproducible/unused/07_all_caesar_hint_news.ipynb diff --git a/reproducible/08_all_caesar_news.ipynb b/reproducible/unused/08_all_caesar_news.ipynb similarity index 100% rename from reproducible/08_all_caesar_news.ipynb rename to reproducible/unused/08_all_caesar_news.ipynb diff --git a/reproducible/09_vignere2_news.ipynb b/reproducible/unused/09_vignere2_news.ipynb similarity index 100% rename from reproducible/09_vignere2_news.ipynb rename to reproducible/unused/09_vignere2_news.ipynb diff --git a/reproducible/10_vignere3_news.ipynb b/reproducible/unused/10_vignere3_news.ipynb similarity index 100% rename from reproducible/10_vignere3_news.ipynb rename to reproducible/unused/10_vignere3_news.ipynb diff --git a/reproducible/11_vignere_long_news.ipynb b/reproducible/unused/11_vignere_long_news.ipynb similarity index 100% rename from reproducible/11_vignere_long_news.ipynb rename to reproducible/unused/11_vignere_long_news.ipynb diff --git a/reproducible/12_vignere_multiple_news.ipynb b/reproducible/unused/12_vignere_multiple_news.ipynb similarity index 100% rename from reproducible/12_vignere_multiple_news.ipynb rename to reproducible/unused/12_vignere_multiple_news.ipynb diff --git a/reproducible/13_vignere_random_news.ipynb b/reproducible/unused/13_vignere_random_news.ipynb similarity index 100% rename from reproducible/13_vignere_random_news.ipynb rename to reproducible/unused/13_vignere_random_news.ipynb diff --git a/reproducible/14_const_enigma_news.ipynb b/reproducible/unused/14_const_enigma_news.ipynb similarity index 100% rename from reproducible/14_const_enigma_news.ipynb rename to reproducible/unused/14_const_enigma_news.ipynb diff --git a/reproducible/15_rigged_caesar_news.ipynb b/reproducible/unused/15_rigged_caesar_news.ipynb similarity index 100% rename from reproducible/15_rigged_caesar_news.ipynb rename to reproducible/unused/15_rigged_caesar_news.ipynb diff --git a/reproducible/16_const_enigma_news_cs.ipynb b/reproducible/unused/16_const_enigma_news_cs.ipynb similarity index 100% rename from reproducible/16_const_enigma_news_cs.ipynb rename to reproducible/unused/16_const_enigma_news_cs.ipynb diff --git a/reproducible/17_const_enigma_news_de.ipynb b/reproducible/unused/17_const_enigma_news_de.ipynb similarity index 100% rename from reproducible/17_const_enigma_news_de.ipynb rename to reproducible/unused/17_const_enigma_news_de.ipynb diff --git a/reproducible/18_const_noisy_enigma_news_de.ipynb b/reproducible/unused/18_const_noisy_enigma_news_de.ipynb similarity index 100% rename from reproducible/18_const_noisy_enigma_news_de.ipynb rename to reproducible/unused/18_const_noisy_enigma_news_de.ipynb diff --git a/reproducible/19_const_noisy_enigma_news_en.ipynb b/reproducible/unused/19_const_noisy_enigma_news_en.ipynb similarity index 100% rename from reproducible/19_const_noisy_enigma_news_en.ipynb rename to reproducible/unused/19_const_noisy_enigma_news_en.ipynb diff --git a/reproducible/20_vignere_noisy_random_news_en.ipynb b/reproducible/unused/20_vignere_noisy_random_news_en.ipynb similarity index 100% rename from reproducible/20_vignere_noisy_random_news_en.ipynb rename to reproducible/unused/20_vignere_noisy_random_news_en.ipynb diff --git a/reproducible/load_and_explore_model.ipynb b/reproducible/unused/load_and_explore_model.ipynb similarity index 100% rename from reproducible/load_and_explore_model.ipynb rename to reproducible/unused/load_and_explore_model.ipynb