In this project, a study of the dataset provided by Quora in its Kaggle competition has been carried out in order to detect duplicate questions. We will fine-tune pretrained transformer models from the Hugging Face library. We will present the results for different models (BERT, XLNet, DistilBERT, ...) and different hyperparameter combinations that have been used. Finally, we will explore sentence embedding meaning.
The dataset is taken from Quora competition at Kaggle:
https://www.kaggle.com/c/quora-question-pairs
- plots: Folder that contains the plots generated in class_visualization.ipynb
- report: Deliverables
- imgs: Images contained in POE_Final_Project_Quora_CanillasRubies.pdf
- Hyperparameters_Study.pdf: Table of the hyperparameters experiments
- POE_Final_Project_Quora_CanillasRubies.pdf: Deliverable report
- POE_Initial_Plan.pdf: First deliverable
- Presentacio-XavierDanae.pdf: Intermediate project presentation
- src: Folder containing script files
- data: CVS files
- train.csv: Raw data
- sentences.csv: Table with questions and tokenizations (from BERT)
- class-consistency.ipynb: Prediction consistency study
- class_visualization.ipynb: Generates plots
- data_analysis.ipynb: Data inference
- input_net.py: Generates the model input
- main.py: Model training and validation
- most_similar_sentence.ipynb: Most similar sentence search
- table_generation.ipynb: Generates sentences.csv
- utils.py: Contains auxiliary functions
-
.gitignore : Untracked files
-
README.md: Project Documentation