The Mosaic memory of Large Language Models

We here elaborate on the code used to generate the results discussed in our manuscript.

(1) Setting up the environment

conda create --name mosaic python=3.9
conda activate --name mosaic
pip install -r requirements.txt

Some models used in this reposority require you to be authenticated on Hugging Face (and have accepted certain model licenses). If you wish to use these models, you will need to log in on Hugging Face, using huggingface-cli login and providing your token.

If you wish to report model training results to Weigths and Biases (as by default in this repository), you will also need to provide a wandb-tokenn when prompted.

(2) Main functionality

2.1 Generating reference canaries.

Reference canaries are generated by sampling synthetic data from Llama-2-7B with a certain temperature. The code used to do so can be found in ./src/generate_canaries.py, alongside the script ./scripts/generate_canaries.sh.

Note that we use the same script ot generate 'member' and 'non-member' canaries, using random seed of 42 and 420 respectively. The temperature used to sample from the model's predicted probability can be set with the variable temp, and is set to 1 for our main results.

The resulting canaries are saved in a pickle file in the desired directory.

2.2 Generating fuzzy duplicates

Fuzzy duplicates can be created using ./src/generate_variations.py, like in the script ./script/generate_near_duplicates.sh.

The main parameters can passed as arguments to the script:

'candidate-gen-strategy': determines the strategy used to create the fuzzy duplicates. All results in the paper use strategy mlm (replacing tokens by one sampled from the top-k predictions from a masked language model) or mlm_random when k is set to the vocabulary size.
'topk' determines the pool from which the tokens are sampled, where lower values correspond to more semantically meaningful replacements. Throught most experiments we set topk to 10.
when 'no-replace-same-indices' is passed as arguments, the tokens are replaced at different indicies in the reference canary for each of the 'num-variations'
'num-injection-points' determines the number of tokens to be replaced.

2.3 Getting the books for finetuning.

We randomly sample 100 books from the dataset containing books from Project Gutenberg not part of PG-19, to be downloaded here.

For reproducibility, we also provide the exact books we used in this work in ./notebooks/recover_books.ipynb.

2.4 Injecting canaries into the dataset

We inject the reference canaries and their fuzzy duplicates in all the books. We provide the code to do so in ./notebooks/NearDuplicatesInjection.ipynb

2.5 Model training

Now we have the reference canaries injected in the dataset, we can finetune the target model. For each set of fuzzy duplicates, we will repeat the finetuning to get to a different target model to be used for membership inference.

The main code to finetune the target model is provided in ./src/fine_tune_model.py. We provide scripts to finetune all models for exact duplicates and fuzzy duplicates sequentually in ./scripts/finetune_gptneo_exact_dupls.sh and ./scripts/finetune_gptneo_near_dupls.sh.

The script takes as arguments the training hyperparameters, the training data (i.e. the books with injected duplicates) and the path to member and non-member reference canaries. It uses the latter to monitor MIA AUC (using the Loss attack) during training.

By default, the script reports the training to Weights and Biases, which will require a login if you'd like to use this as well.

Other setups considered in the paper (different base model, learning rate, etc) can be easily achieved by updating the parameters.

At the end of training, the final model is saved to the desired directory, which we will then use for final membership inference results.

2.6 Membership inference

Now we have the finetuned models, we want to evaluate the MIA. We run this for each target model individually and sequentially, with example code provided in ./notebooks/membership_inference_example.ipynb.

The results are then saved as a pickle file, to be used for plotting.

2.7 Plotting results

Example plotting results (for the main figures), including the computation of the custom metric rho, can be found in ./notebooks/metric_figure.ipynb.

(3) Secondary experiments

In our paper, we also provide a range of secondary experiments and results. We elaborate on the code used for that below.

3.1 Insertion of random tokens

The code to create the dataset, and analyze the results is in ./notebooks/insertion.ipynb and ./notebooks/insertion_results.ipynb, respectively, while the associated script is ./scripts/finetune_gptneo_insertions.sh.

3.2 Shuffling of tokens

The code to create the dataset, and analyze the results is in kendall_tau.ipynb and kendall_tau_results.ipynb, respectively, while the associated script is ./scripts/finetune_gptneo_shuffle.sh.

3.3 Paraphrasing

The paraphrases have been generated using ./src/paraphrase.py and ./scripts/generate_paraphrases.sh. The injection of paraphrases and the membership inference is then executed following the same process as for the other experiments.

3.4 Ablations

All ablations reported in the paper (learning rate, model size, temperature to generate reference canaries) have been generated using the main functionality of the code as described above.

(4) Finding fuzzy duplicates in a real-world dataset

This repo provides the code for analyzing near-duplicates present in the SlimPajama dataset. It's a multi-step process, which is best done on the machine with at least 80 CPUs and enough memory to fit a significant portion of the dataset into memory.

This code is derived from https://github.com/google-research/deduplicate-text-datasets

Step 1: Build the rust code

Install rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Build:

cargo build

Step 2: Download and process SlimPajama

The entire dataset is ~900GB in size, and deduplication process requires multiple times over the dataset size to fit into memory. We split the dataset into 20 subsets and process the deduplication in each subset individually, and then merging hihgly repeated sequences.

First, you'd need to download SlimPajama. While it can be done as a part of the processing script, we find pre-downloading the data to be more stable.

git clone the repo into the directory of your choosing (it will download ~900GB of data)

cd /path/to/slimpajama && git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

Then, run the processing script, tokenizing the dataset and converting it into the format for deduplication:

python py_src/dedupe/load_slimpajama.py 
--slimpajama-path cd /path/to/slimpajama \
--save-dir /working/dir/tokenized \
--name slimpajama \
--tokenize \
--parts 20 \
--num-proc 80

It will take a while - first huggingface will cache the dataset for faster random access, and then launching the tokenization.

It should create 20 tokenized subsets in the /working/dir/tokenized dir: e.g. slimpajama_0_of_20.train (and slimpajama_0_of_20.train.size) with document sizes.

Step 3: Find (exact) duplicates

We'll now build suffix arrays for deduplication. Update /working/dir/ in scripts/all_suffix_arrays.sh and scripts/all_self_similar.sh with the actual dir containing tokenized slimpajama.

bash scripts/all_suffix_arrays.sh

bash scripts/all_self_similar.sh

This step is the longest - roughly 3-4 hours per subset (for 20 subsets), and the most resource intensive.

As a result, you'll get a number of binary files in /working/dir/caches/cache* containing positions and counts of all exact duplicates in each subset.

Step 4: Prepare for near-duplicate scan.

Scanning for near-duplicates, directly computing Hamming/Levenshtein distances is extremely expensive and cannot be done on the full dataset scale. We make several assumptions, allowing us to reduce the computational cost to a feasible level.

We select a small group of "target sequences", and we'll be scanning the dataset only looking for near duplicates of these sequences. We sample target sequences based on the number of times they are duplicated exactly in the dataset.

First, we sample a number of exact duplicates present in one of the chunks (chunk0) and for each find the number of times they're duplicates in the entire dataset, across all chunks.

mkdir -p /working/dir/queries/

# --target-counts 5 50 500 - roughly corresponding to 100, 1000, 10_000 final buckets (20x chunks)
# --length 100 - in tokens
# --n-per-bucket 10000 - this step is relatively cheap, it's better to overshoot
# --bytes-per-record 5 - depending on the dataset size, byte-packing allocated different number of byter per one index
# for slimpajama it's 5

python py_src/near_duplicates/build_query.py \ 
--output-dir /working/dir/queries/ \
--dups-dir /working/dir/caches/cache0/ \
--ds-path /working/dir/tokenized/slimpajama_0_of_20.train \
--target-counts 5 50 500 \
--length 100 \
--n-per-bucket 10000 \
--bytes-per-record 5

This creates two files in /working/dir/queries/: sorted list of positions.pkl, and a log file from calling count-occurences-multi, where one line corresponds to one input position.

We now sample the required number of target sequences, given a full dataset duplication counts.

python py_src/near_duplicates/build_targets.py \
--output-path /working/dir/target_sequences.pkl \
--positions-path /working/dir/queries/positions.pkl \
--counts-dir /working/dir/queries/counts/ \
--ds-path /working/dir/tokenized/slimpajama_0_of_20.train \
--target-buckets 100 1000 10000 \
--target-bucket-tolerance 0.01 \
--length 100 \
--n-per-bucket 100 \
--uniq-token-min 50

Finally, we launch the scan. Again, for computational reasons we only scan one chunk (5% of the dataset) and extrapolate the results

python py_src/near_duplicates/scan.py \
--ds-path /working/dir/tokenized/slimpajama_0_of_20.train \
--save-dir /working/dir/scan/ \
--target-sequences-path /working/dir/target_sequences.pkl \

This produces a number of files in /working/dir/scan/ - all that is needed to build the final graphs using plot_near_duplicates.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
scripts		scripts
slimpajama		slimpajama
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Mosaic memory of Large Language Models

(1) Setting up the environment

(2) Main functionality

(3) Secondary experiments

(4) Finding fuzzy duplicates in a real-world dataset

Step 1: Build the rust code

Step 2: Download and process SlimPajama

Step 3: Find (exact) duplicates

Step 4: Prepare for near-duplicate scan.

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

computationalprivacy/mosaic_memory

Folders and files

Latest commit

History

Repository files navigation

The Mosaic memory of Large Language Models

(1) Setting up the environment

(2) Main functionality

(3) Secondary experiments

(4) Finding fuzzy duplicates in a real-world dataset

Step 1: Build the rust code

Step 2: Download and process SlimPajama

Step 3: Find (exact) duplicates

Step 4: Prepare for near-duplicate scan.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages