This repository contains the code used for our project in the course 02456 Deep Learning at the Technical University of Denmark (DTU).
First, clone the repository
# clone project
git clone https://github.com/A-Lohse/deeplearningproject
# install project
cd deeplearningproject
To generate and run most outputs and models, you will have to download the embedding tensors from sentence-BERT (including a finetuned version) as these are to big to store on Github (links below). Place the tensors in the directory /data/processed/
.
If you just want to replicate the plots and tables presented in the paper then
# src folder
cd make_plot
and run baseline_models_and_plots.py
which loads the trained models from the directory /trained_models
. All the models can be found on google drive link. It prints metrics to console and creates plots and tables in /plots_tables
If you instead want to train the models then you can run the following commands
# module folder
python3 -m src.train_sbert_downstream
Where the flag --finetuned_embeddings
indicates if the finetuned embeddings should be used or not. The standard BERT can be trained using the notebook finetuning-BERT.ipynb
.
Several modules under /src/prepare_data/
are used to prepare the data for our models. This includes data cleaning, finetuning both sentence-BERT and vanilla BERT and extracting document embeddings. Below follows an overview of what they do.
1. Generating metadata
Apart from the Bill text we include the following metadata
bill_status (outcome variable)
: Dummy of bill status (1 if enacted, 0 otherwise)cosponsors
: Interger value of the amount of cosponsorsmajority
: Dummy of if bill proposing party is in majorityparty
: Party dummygender
: Dummy of if the bill proposing politician is male/female
The data comes from the Congressional Bills Project and the original data can be downloaded here and is prepared using the script generate_metadata.py
.
2. Generate finetuning and embedding extraction data for BERT/S-BERT
The Bill text data used to finetune BERT and extract bill Embeddings comes from the BillSum project. Specifically the two datafiles data/raw/us_train_sent_scores.pkl
and data/raw/us_train_sent_scores.pkl
are used. The module generate_bert_finetuning_data.py
extracts the relevant text from BillSum data and merges it with the bill with meta data, including if the Bill was enacted or not through the unique bill ID.
3. Finetuning sentence-BERT
A python script has been prepared for finetuning sentence BERT. It can be found in /src/prepare_data/fine-tuning_SBERT.py
The fine-tuned model is stored locally, when running the script. It will output validation metrics each epoch.
We have made our final fine-tuned model accesible through Google Drive. In the zip-file their is a README explaining how to use the model.
4. Extracting Bill Embeddings
To extract the Bill Embeddings we feed to the downstream tasks we pass the data prepared in step 2 to Sentence-BERT.
Extra: getting reuslts for plots and tables
If you wish to train new models, and obtain create new results, plots and tables, prepare the data as described, then:
5. Train models
Place them in `/trained models´ - make sure that they are named with "meta" and "CNN" or "FNN" as well as "avg" if you average the the sentence embeddings in the FNN. This will make sure that the models are loaded correctly in the next step.
6. Predict on data
Run make_predictions.py in /prepare_data.py - This will create a predictions.pkl file in the data/results folder. This file contains a dictionary with all the model names as keys, and contains targets, predicted, probas and false/negative positive rate as well as precision recall curve. This file is used for plotting and creating tables
Kornilova, A., & Eidelman, V. (2019). Billsum: A corpus for automatic summarization of us legislation. arXiv preprint arXiv:1910.00523.