From e9c36422d85dce7faf1132961b7030f65f17ba94 Mon Sep 17 00:00:00 2001
From: Jan Provaznik <janpro@janpro.dev>
Date: Thu, 9 May 2024 16:50:21 +0200
Subject: [PATCH] reorganize repo for thesis submission

---
 README.md                                     | 55 +++++++++++++------
 ...s_analysis.ipynb => evo_correlation.ipynb} |  0
 .../21_vignere3_noisy_random_news_en.ipynb    |  2 +-
 .../22_vignere3_noisy_random_news_de.ipynb    |  2 +-
 .../23_vignere3_noisy_random_news_cs.ipynb    |  2 +-
 .../24_const_noisy_enigma_news_cs.ipynb       |  2 +-
 .../25_const_noisy_enigma_news_de.ipynb       |  2 +-
 .../26_const_noisy_enigma_news_en.ipynb       |  2 +-
 .../{ => unused}/01_copy_random_text.ipynb    |  0
 reproducible/{ => unused}/02_copy_news.ipynb  |  0
 .../{ => unused}/03_caesar_random_text.ipynb  |  0
 .../{ => unused}/04_caesar_news.ipynb         |  0
 .../{ => unused}/05_triple_caesar_news.ipynb  |  0
 .../06_all_caesar_hint_random_text.ipynb      |  0
 .../07_all_caesar_hint_news.ipynb             |  0
 .../{ => unused}/08_all_caesar_news.ipynb     |  0
 .../{ => unused}/09_vignere2_news.ipynb       |  0
 .../{ => unused}/10_vignere3_news.ipynb       |  0
 .../{ => unused}/11_vignere_long_news.ipynb   |  0
 .../12_vignere_multiple_news.ipynb            |  0
 .../{ => unused}/13_vignere_random_news.ipynb |  0
 .../{ => unused}/14_const_enigma_news.ipynb   |  0
 .../{ => unused}/15_rigged_caesar_news.ipynb  |  0
 .../16_const_enigma_news_cs.ipynb             |  0
 .../17_const_enigma_news_de.ipynb             |  0
 .../18_const_noisy_enigma_news_de.ipynb       |  0
 .../19_const_noisy_enigma_news_en.ipynb       |  0
 .../20_vignere_noisy_random_news_en.ipynb     |  0
 .../{ => unused}/load_and_explore_model.ipynb |  0
 29 files changed, 43 insertions(+), 24 deletions(-)
 rename analysis/{train_dynamics_analysis.ipynb => evo_correlation.ipynb} (100%)
 rename reproducible/{ => unused}/01_copy_random_text.ipynb (100%)
 rename reproducible/{ => unused}/02_copy_news.ipynb (100%)
 rename reproducible/{ => unused}/03_caesar_random_text.ipynb (100%)
 rename reproducible/{ => unused}/04_caesar_news.ipynb (100%)
 rename reproducible/{ => unused}/05_triple_caesar_news.ipynb (100%)
 rename reproducible/{ => unused}/06_all_caesar_hint_random_text.ipynb (100%)
 rename reproducible/{ => unused}/07_all_caesar_hint_news.ipynb (100%)
 rename reproducible/{ => unused}/08_all_caesar_news.ipynb (100%)
 rename reproducible/{ => unused}/09_vignere2_news.ipynb (100%)
 rename reproducible/{ => unused}/10_vignere3_news.ipynb (100%)
 rename reproducible/{ => unused}/11_vignere_long_news.ipynb (100%)
 rename reproducible/{ => unused}/12_vignere_multiple_news.ipynb (100%)
 rename reproducible/{ => unused}/13_vignere_random_news.ipynb (100%)
 rename reproducible/{ => unused}/14_const_enigma_news.ipynb (100%)
 rename reproducible/{ => unused}/15_rigged_caesar_news.ipynb (100%)
 rename reproducible/{ => unused}/16_const_enigma_news_cs.ipynb (100%)
 rename reproducible/{ => unused}/17_const_enigma_news_de.ipynb (100%)
 rename reproducible/{ => unused}/18_const_noisy_enigma_news_de.ipynb (100%)
 rename reproducible/{ => unused}/19_const_noisy_enigma_news_en.ipynb (100%)
 rename reproducible/{ => unused}/20_vignere_noisy_random_news_en.ipynb (100%)
 rename reproducible/{ => unused}/load_and_explore_model.ipynb (100%)

diff --git a/README.md b/README.md
index 52ab057..ab03981 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,11 @@
 # Enigma Transformed 
 
 ## Abstract
-This project explores the possibility of using a pretrained language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by varying the test set and measuring accuracy.
-
+We explore the possibility of using a pre-trained Transformer language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by measuring correlations of error rates.
 
+1. create evaluation dataset with linguistic properties
+2. train model on decipherment
+3. evaluate correlations and predictability from linguistic properties
 
 ## Docs
 ### How to run 
@@ -14,7 +16,7 @@ pip install -e .
 ```
 #### Slurm cluster
 - basic setting: `sbatch -p gpu -c1 --gpus=1 --mem=16G <bash_script_path>`
-- use `run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster
+- use `./run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster
 
 #### Colab
 - clone this repo and use the desired `.ipynb` files
@@ -23,17 +25,16 @@ pip install -e .
 !git clone https://github.com/JanProvaznik/enigma-transformed
 !pip install transformers[torch] Levenshtein py-enigma
 ```
-### Meta info
-- uses lowercased letters in all experiments
-- usually preserving spaces, punctuation in some
-- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
-- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
-- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
-- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers
 
 ### Source code
 #### reproducible/
-- for each experiment contians a notebook that can be used to reproduce it in a readable manner
+- scripts for fine-tuning ByT5 on ciphers
+- Used in thesis: 
+    - `21_vignere3_noisy_random_news_en.ipynb`, `22_vignere3_noisy_random_news_de.ipynb`, `23_vignere3_noisy_random_news_cs.ipynb` finetuning ByT5 to decrypt a random 3-letter key Vignere cipher on news sentences
+    - `24_const_noisy_enigma_news_cs.ipynb`, `25_const_noisy_enigma_news_de.ipynb`, `26_const_noisy_enigma_news_en.ipynb` finetuning ByT5 to decrypt a simplified Enigma cipher on news sentences
+
+
+- old experiments in `unused/` 
     - `01_copy_random_text.ipynb` - trains model to copy on random strings
     - `02_copy_news.ipynb` - trains model to copy on news sentences
     - `03_caesar_random_text.ipynb` - trains model to decrypt constant caesar cipher (only one setting) on random strings
@@ -47,10 +48,22 @@ pip install -e .
     - `10_vignere3_news` - trains model to decrypt constant 3 letter vignere cipher on news sentences
     - `11_vignere_long_news` - trains model to decrypt constant vignere cipher with key 'helloworld' on news sentences
     - `12_vignere_multiple_news` - trains model to decrypt 2 letter vignere cipher, with 3 settings on news sentences
+    - ... and more
+
+#### data/
+- `weird_classify.ipynb` and `lang_classify.ipynb` - filter out sentences
+- `measure_dataset(cs,de).ipynb` - annotate linguistic properties of a dataset
+- `evaluation_batchedgpuevaluate_other_models.ipynb` - inference decipherments by different model checkpoints
 
+#### analysis/
+- `loss_curves.ipynb` - visualize loss curves of training with error density at checkpoints
+- `corr_matrices.ipynb` - to create correlation matrices of error rates and linguistic properties
+- `evo_correlation.ipynb` - to graph of evolution of correlations of error rates and linguistic properties
+- `pred_shap.ipynb` - predict error rates with simple ML and analyze with shap
 
-#### run_notebook.sh
-- script for running a notebook on a slurm cluster 
+
+#### run_notebook.sh and run_notebook4gpu.sh
+- script for running training or inference notebooks on a slurm cluster with GPUs
 
 
 #### src/
@@ -77,8 +90,7 @@ pip install -e .
 ##### `lens_train.py`
 - script to replicate reproducible/03 with [TransformerLens](https://github.com/neelnanda-io/TransformerLens) library and minimal amount of resources (only 1 layer transformer)
 
-
-### Usual experiment pipeline
+### What happens when training
 0. get data from the internet or generate it
 1. filter the data for the given experiment (e.g. only sentences 100-200 characters long)
 2. preprocess the data: only a-z + spaces, trim/pad to desired length
@@ -87,19 +99,26 @@ pip install -e .
 5. train the model on the training pairs
 6. save the model
 7. evaluate the performance of the model (during training and after training)
-    - e.g. edit distances
+
+### Meta info
+- uses lowercased letters in all experiments
+- vigenere preserves spaces, enigma replaces them with X
+- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
+- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
+- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
+- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers
 
 ### Training hyperparameters:
 #### number of training examples
 - the more the better (if model sees all cipher configuations it won't have to generalize the cipher procedure, but only detect which configuation is used and apply it)
 
 #### trainable parameters in model
-- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time to use big batch sizes
+- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time  using big batch sizes
 #### epochs
 - the more the better, but we're limited by the time we have
 
 #### batch size
-- if too low, model won't be able to learn any patterns
+- if too low, models won't be able to learn any patterns
 - generally the higher the better, but we're limited by the GPU memory 
     - trick: use gradient accumulation 
         - e.g. if we have batch size 16 and gradient accumulation 16 -> the effective batch size is 256
diff --git a/analysis/train_dynamics_analysis.ipynb b/analysis/evo_correlation.ipynb
similarity index 100%
rename from analysis/train_dynamics_analysis.ipynb
rename to analysis/evo_correlation.ipynb
diff --git a/reproducible/21_vignere3_noisy_random_news_en.ipynb b/reproducible/21_vignere3_noisy_random_news_en.ipynb
index c19d9d4..3aa21f8 100644
--- a/reproducible/21_vignere3_noisy_random_news_en.ipynb
+++ b/reproducible/21_vignere3_noisy_random_news_en.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on EN news dataset"
    ]
   },
   {
diff --git a/reproducible/22_vignere3_noisy_random_news_de.ipynb b/reproducible/22_vignere3_noisy_random_news_de.ipynb
index b034a12..08558a3 100644
--- a/reproducible/22_vignere3_noisy_random_news_de.ipynb
+++ b/reproducible/22_vignere3_noisy_random_news_de.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on DE news dataset"
    ]
   },
   {
diff --git a/reproducible/23_vignere3_noisy_random_news_cs.ipynb b/reproducible/23_vignere3_noisy_random_news_cs.ipynb
index 7cd6950..4b24934 100644
--- a/reproducible/23_vignere3_noisy_random_news_cs.ipynb
+++ b/reproducible/23_vignere3_noisy_random_news_cs.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on CS news dataset"
    ]
   },
   {
diff --git a/reproducible/24_const_noisy_enigma_news_cs.ipynb b/reproducible/24_const_noisy_enigma_news_cs.ipynb
index 8716a59..41071d3 100644
--- a/reproducible/24_const_noisy_enigma_news_cs.ipynb
+++ b/reproducible/24_const_noisy_enigma_news_cs.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Enigma cipher on CS news dataset"
    ]
   },
   {
diff --git a/reproducible/25_const_noisy_enigma_news_de.ipynb b/reproducible/25_const_noisy_enigma_news_de.ipynb
index c2e2cff..ac4ff97 100644
--- a/reproducible/25_const_noisy_enigma_news_de.ipynb
+++ b/reproducible/25_const_noisy_enigma_news_de.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Enigma cipher on DE news dataset"
    ]
   },
   {
diff --git a/reproducible/26_const_noisy_enigma_news_en.ipynb b/reproducible/26_const_noisy_enigma_news_en.ipynb
index 384cbdb..c803b58 100644
--- a/reproducible/26_const_noisy_enigma_news_en.ipynb
+++ b/reproducible/26_const_noisy_enigma_news_en.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Const noisy enigma on english"
+    "# Const noisy enigma on english news dataset"
    ]
   },
   {
diff --git a/reproducible/01_copy_random_text.ipynb b/reproducible/unused/01_copy_random_text.ipynb
similarity index 100%
rename from reproducible/01_copy_random_text.ipynb
rename to reproducible/unused/01_copy_random_text.ipynb
diff --git a/reproducible/02_copy_news.ipynb b/reproducible/unused/02_copy_news.ipynb
similarity index 100%
rename from reproducible/02_copy_news.ipynb
rename to reproducible/unused/02_copy_news.ipynb
diff --git a/reproducible/03_caesar_random_text.ipynb b/reproducible/unused/03_caesar_random_text.ipynb
similarity index 100%
rename from reproducible/03_caesar_random_text.ipynb
rename to reproducible/unused/03_caesar_random_text.ipynb
diff --git a/reproducible/04_caesar_news.ipynb b/reproducible/unused/04_caesar_news.ipynb
similarity index 100%
rename from reproducible/04_caesar_news.ipynb
rename to reproducible/unused/04_caesar_news.ipynb
diff --git a/reproducible/05_triple_caesar_news.ipynb b/reproducible/unused/05_triple_caesar_news.ipynb
similarity index 100%
rename from reproducible/05_triple_caesar_news.ipynb
rename to reproducible/unused/05_triple_caesar_news.ipynb
diff --git a/reproducible/06_all_caesar_hint_random_text.ipynb b/reproducible/unused/06_all_caesar_hint_random_text.ipynb
similarity index 100%
rename from reproducible/06_all_caesar_hint_random_text.ipynb
rename to reproducible/unused/06_all_caesar_hint_random_text.ipynb
diff --git a/reproducible/07_all_caesar_hint_news.ipynb b/reproducible/unused/07_all_caesar_hint_news.ipynb
similarity index 100%
rename from reproducible/07_all_caesar_hint_news.ipynb
rename to reproducible/unused/07_all_caesar_hint_news.ipynb
diff --git a/reproducible/08_all_caesar_news.ipynb b/reproducible/unused/08_all_caesar_news.ipynb
similarity index 100%
rename from reproducible/08_all_caesar_news.ipynb
rename to reproducible/unused/08_all_caesar_news.ipynb
diff --git a/reproducible/09_vignere2_news.ipynb b/reproducible/unused/09_vignere2_news.ipynb
similarity index 100%
rename from reproducible/09_vignere2_news.ipynb
rename to reproducible/unused/09_vignere2_news.ipynb
diff --git a/reproducible/10_vignere3_news.ipynb b/reproducible/unused/10_vignere3_news.ipynb
similarity index 100%
rename from reproducible/10_vignere3_news.ipynb
rename to reproducible/unused/10_vignere3_news.ipynb
diff --git a/reproducible/11_vignere_long_news.ipynb b/reproducible/unused/11_vignere_long_news.ipynb
similarity index 100%
rename from reproducible/11_vignere_long_news.ipynb
rename to reproducible/unused/11_vignere_long_news.ipynb
diff --git a/reproducible/12_vignere_multiple_news.ipynb b/reproducible/unused/12_vignere_multiple_news.ipynb
similarity index 100%
rename from reproducible/12_vignere_multiple_news.ipynb
rename to reproducible/unused/12_vignere_multiple_news.ipynb
diff --git a/reproducible/13_vignere_random_news.ipynb b/reproducible/unused/13_vignere_random_news.ipynb
similarity index 100%
rename from reproducible/13_vignere_random_news.ipynb
rename to reproducible/unused/13_vignere_random_news.ipynb
diff --git a/reproducible/14_const_enigma_news.ipynb b/reproducible/unused/14_const_enigma_news.ipynb
similarity index 100%
rename from reproducible/14_const_enigma_news.ipynb
rename to reproducible/unused/14_const_enigma_news.ipynb
diff --git a/reproducible/15_rigged_caesar_news.ipynb b/reproducible/unused/15_rigged_caesar_news.ipynb
similarity index 100%
rename from reproducible/15_rigged_caesar_news.ipynb
rename to reproducible/unused/15_rigged_caesar_news.ipynb
diff --git a/reproducible/16_const_enigma_news_cs.ipynb b/reproducible/unused/16_const_enigma_news_cs.ipynb
similarity index 100%
rename from reproducible/16_const_enigma_news_cs.ipynb
rename to reproducible/unused/16_const_enigma_news_cs.ipynb
diff --git a/reproducible/17_const_enigma_news_de.ipynb b/reproducible/unused/17_const_enigma_news_de.ipynb
similarity index 100%
rename from reproducible/17_const_enigma_news_de.ipynb
rename to reproducible/unused/17_const_enigma_news_de.ipynb
diff --git a/reproducible/18_const_noisy_enigma_news_de.ipynb b/reproducible/unused/18_const_noisy_enigma_news_de.ipynb
similarity index 100%
rename from reproducible/18_const_noisy_enigma_news_de.ipynb
rename to reproducible/unused/18_const_noisy_enigma_news_de.ipynb
diff --git a/reproducible/19_const_noisy_enigma_news_en.ipynb b/reproducible/unused/19_const_noisy_enigma_news_en.ipynb
similarity index 100%
rename from reproducible/19_const_noisy_enigma_news_en.ipynb
rename to reproducible/unused/19_const_noisy_enigma_news_en.ipynb
diff --git a/reproducible/20_vignere_noisy_random_news_en.ipynb b/reproducible/unused/20_vignere_noisy_random_news_en.ipynb
similarity index 100%
rename from reproducible/20_vignere_noisy_random_news_en.ipynb
rename to reproducible/unused/20_vignere_noisy_random_news_en.ipynb
diff --git a/reproducible/load_and_explore_model.ipynb b/reproducible/unused/load_and_explore_model.ipynb
similarity index 100%
rename from reproducible/load_and_explore_model.ipynb
rename to reproducible/unused/load_and_explore_model.ipynb