The code is slightly updated. Previous code are located here, https://github.com/sangje-lee/covid-19-data-mining-old
The dataset, “The Covid-19 Open Research Dataset” contains all the research papers that are related to COVID-19 mostly in 2020 to 2022 with some of research paper dated before 2022. The (final) dataset itselfis roughly 18GB, consist of four of the columns, “cord_uid”, unique identifier, “title”, “published_time”, and “abstract”. The problem primarily will involve the ‘title’ column, to find keywords primarily focus on "Patient", "Disease", “Vaccine”, and "Infect". For this analysis, the modified dataset which only the resarch papers dated between 2020 to 2022 is being used.
-
Goal: To find any keywords inside the dataset that contains title of the research paper. The search of keyword involved primarily on 'title' columns.
-
Method: Classification, Assocation Rules, Time Series, Decision Tree/Random Forest/Emsemble-Begging, and Text/Data mining
-
Tools: Ananconda (Conda) Virtual Environment with Python version 3.9.12, Numpy, Pandas, mlxtend, matplotlib, seaborn, sklearn, NLTK
Python (Jupiter Notebook) Script
-
Jupiter_Notebook_Final_Initial_Code_Part_01.ipynb ==> Related to text mining (Part 1)
-
Jupiter_Notebook_Final_Initial_Code_Part_02.ipynb ==> Related to classification. (Part 2)
-
Jupiter_Notebook_Final_Initial_Code_Part_03.ipynb ==> Related to modifying the dataset and bagging/ensemble model. (Part 3)
-
"Html" file available (Github/Drive) as well as "pdf" file (Google Drive)
-
Dataset_Without_Abstract.txt ==> Dataset (modified dataset) used for this project
Link of the modified dataset that will be used during this analysis. Unable to put inside Github because of the dataset size
https://drive.google.com/file/d/1NNlfCUdVFTk1ADTSio2Th2Bh42rvrLZc/view?usp=drive_link
Alternative (Google Drive), https://drive.google.com/drive/folders/1vxsa1mW9UbiiEtPavJDZQinDGFNIvZo2?usp=drive_link
-> Contain all the datasets including the original dataset, alternative dataset with abstract columns, and script that modify original dataset into modified dataset to fit for this project.
- What are the main keywords aside from the dataset and the grammar structures appear in the title column? (Part 1)
- How frequent do the main keywords appear in the title column? (Part 2)
- What are the keyword patterns associated with the initial keyword? (Ex. Vaccine → Test) (Part 3)
- What does the probability of the keyword ‘infect’ appearing inside the title column aside from the other keywords and duration? (Part 4)
- Clone/Download the respository
- Run the code on the Python (particularly conda/anaconda virutal environment) 3.9.12.
- There may be warning pop up from the vs code or the jupiter notebook.
- Upgrade to newer Python version may pop up new error and may not run correctly.
Package Includes
- matplotlib
- nltk
- numpy
- pandas
- scikit-learn
- seaborn
- sklearn
Alternative (Google Drive - Previous version), https://drive.google.com/drive/u/1/folders/1fNlXaACqC11my3Tswibd8S_Sh1fChSgL
-> Contain all the datasets including the original dataset, alternative dataset with abstract columns, and script that modify original dataset into modified dataset to fit for this project.
Python (Jupiter Notebook) Script (Inside Google Drive)
- Jupiter_Notebook_Final_Initial_Code_Part_00a_Modify_Dataset.ipynb ==> Modified from original dataset to dataset that will be analysis for this project.
- Jupiter_Notebook_Final_Initial_Code_Part_00b_Initial_stage.ipynb ==> Related to my initial analysis.
Miscellious Files
- Output_order_fdist.csv (Inside Google Drive) ==> List of the words that are extracted from ‘title’ column with frequency.
- Output_By_Data_Mining "Folder" ==> List of the words that are extracted based on the categories.
-> Output_order_fdist_by_2020.csv ==> Words extracted based on year of 2020
-> Output_order_fdist_by_2021.csv ==> Words extracted based on year of 2021
-> Output_order_fdist_by_2022.csv ==> Words extracted based on year of 2022
-> Output_order_fdist_lockdown.csv ==> Words extracted based on keyword ‘lockdown’
-> Output_order_fdist_test.csv ==> Words extracted based on keyword ‘test’
-> Output_order_fdist_vaccine.csv ==> Words extracted based on keyword ‘vaccine’
-> Output_order-fdist_vaccine_lockdown.csv ==> Words extracted based on keyword pattern ‘vaccine -> lockdown’
-> Output_order_fdist_vaccine_test.csv ==> Words extracted based on keyword pattern ‘vaccine -> test’
-> Output_order_fdist_worker.csv ==> Words extracted based on keyword ‘worker’
Sangjin (Eric S) Lee