Bill Prediction with Sentence-BERT 🚀⚡🔥

By: August Lohse, S216350; Espen Rostrup, S215937; Matias Piqueras, S216005

Description

This repository contains the code used for our project in the course 02456 Deep Learning at the Technical University of Denmark (DTU).

How to run

First, clone the repository

# clone project   
git clone https://github.com/A-Lohse/deeplearningproject
# install project   
cd deeplearningproject

To generate and run most outputs and models, you will have to download the embedding tensors from sentence-BERT (including a finetuned version) as these are to big to store on Github (links below). Place the tensors in the directory /data/processed/ .

sentence-BERT embeddings

If you just want to replicate the plots and tables presented in the paper then

# src folder 
cd make_plot

and run baseline_models_and_plots.py which loads the trained models from the directory /trained_models. All the models can be found on google drive link. It prints metrics to console and creates plots and tables in /plots_tables

If you instead want to train the models then you can run the following commands

# module folder
python3 -m src.train_sbert_downstream

Where the flag --finetuned_embeddings indicates if the finetuned embeddings should be used or not. The standard BERT can be trained using the notebook finetuning-BERT.ipynb.

Extra

Several modules under /src/prepare_data/ are used to prepare the data for our models. This includes data cleaning, finetuning both sentence-BERT and vanilla BERT and extracting document embeddings. Below follows an overview of what they do.

1. Generating metadata

Apart from the Bill text we include the following metadata

bill_status (outcome variable): Dummy of bill status (1 if enacted, 0 otherwise)
cosponsors: Interger value of the amount of cosponsors
majority: Dummy of if bill proposing party is in majority
party: Party dummy
gender: Dummy of if the bill proposing politician is male/female

The data comes from the Congressional Bills Project and the original data can be downloaded here and is prepared using the script generate_metadata.py.

2. Generate finetuning and embedding extraction data for BERT/S-BERT

The Bill text data used to finetune BERT and extract bill Embeddings comes from the BillSum project. Specifically the two datafiles data/raw/us_train_sent_scores.pkl and data/raw/us_train_sent_scores.pkl are used. The module generate_bert_finetuning_data.py extracts the relevant text from BillSum data and merges it with the bill with meta data, including if the Bill was enacted or not through the unique bill ID.

3. Finetuning sentence-BERT

A python script has been prepared for finetuning sentence BERT. It can be found in /src/prepare_data/fine-tuning_SBERT.py The fine-tuned model is stored locally, when running the script. It will output validation metrics each epoch. We have made our final fine-tuned model accesible through Google Drive. In the zip-file their is a README explaining how to use the model.

4. Extracting Bill Embeddings

To extract the Bill Embeddings we feed to the downstream tasks we pass the data prepared in step 2 to Sentence-BERT.

Extra: getting reuslts for plots and tables

If you wish to train new models, and obtain create new results, plots and tables, prepare the data as described, then:

5. Train models

Place them in `/trained models´ - make sure that they are named with "meta" and "CNN" or "FNN" as well as "avg" if you average the the sentence embeddings in the FNN. This will make sure that the models are loaded correctly in the next step.

6. Predict on data

Run make_predictions.py in /prepare_data.py - This will create a predictions.pkl file in the data/results folder. This file contains a dictionary with all the model names as keys, and contains targets, predicted, probas and false/negative positive rate as well as precision recall curve. This file is used for plotting and creating tables

References

Kornilova, A., & Eidelman, V. (2019). Billsum: A corpus for automatic summarization of us legislation. arXiv preprint arXiv:1910.00523.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Bill Prediction with Sentence-BERT 🚀⚡🔥

By: August Lohse, S216350; Espen Rostrup, S215937; Matias Piqueras, S216005

Description

How to run

Extra

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Bill Prediction with Sentence-BERT 🚀⚡🔥

By: August Lohse, S216350; Espen Rostrup, S215937; Matias Piqueras, S216005

Description

How to run

Extra

References