One-Shot-DocClassifier

Bachelor Thesis Project - Document classification based in topic definitions

This repository holds the contents developed for my bachelor thesis, titled: "Clasificación de documentos basada en definiciones de categorías". Is structured in 2 main folders:

lib: Contains the python modules developed for the library. Logic and functions for text preprocessing, data extraction and models. Inside, the modules: doc_utils.py, arxiv_parser.py, wiki_parser.py, max_sim_classifier.py.

notebooks: Contains .ipynb notebooks with practical examples on using the library as well as some of the results obtained. For the best execution ease, run them inside Google Colaboratory and just upload the modules contained inside "lib" folder.

How to visualize or run the sample notebooks from Colaboratory

Download the notebooks from: Notebooks\********.ipynb
Go to Google Colaboratory
Go to File > Open notebook > Upload
Select the desired notebook and launch it

List of modules and contents

Text preprocessing and vectorization module: "doc_utils.py":

Contains the following utilities and auxiliary functions for the project (documented inside code library):

Data cleaning / vectorization / BoW
- prepare_corpus()
- prepare_train_articles()
- cleanText()
- vectSeq()
Data processing for classifier inputs
- processNeuralNetData()
- processClassifierData()
Classification metrics / evaluation
- top2acc()
- plotConfMatrix()
- plotDefinitionsLength()

Building the datasets: Wikipedia and Arxiv crawlers

Wikipedia module: "wiki_parser.py":*
- getWikiSummaries()
- getWikiFullPage()
- concurrentGetWikiFullPage()
- getCatMembersList()
- getCatMembersTexts()
- getAllCatArticles()
- concurrentGetAllCatArticles()
ArXiv module: "arxiv_parser.py"
- init_parser()
- arxiv_parser()

Maximum Similarity Classifier module: "max_sim_classifier.py"

MaxSimClassifier(): Classifier class compatible with sklearn , fitting, inference and label propagation functions. With the following methods:
- fit()
- fit_articles()
- predict()
- score()
- pseudo_label()

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Notebooks		Notebooks
lib		lib
.gitignore		.gitignore
DocClassifierThesisSummary.pdf		DocClassifierThesisSummary.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

One-Shot-DocClassifier

About

Uh oh!

Releases

Packages

Languages

andresC98/One-Shot-DocClassifier

Folders and files

Latest commit

History

Repository files navigation

One-Shot-DocClassifier

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages