Project 3 for University of Oklahoma LIS4693
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
This project is an advanced information retrieval and text mining system designed for analyzing and exploring text data. It allows users to search, filter, cluster, and explore a corpus of documents interactively. The system provides tools for efficient text retrieval, clustering, classification, and document exploration.
Key features include:
- Search Functionality: Perform keyword or phrase searches across the corpus using TF-IDF for relevance scoring and fallback full-text search.
- K-Means Clustering: Group documents into clusters and visualize them using PCA scatter plots and category distributions.
- Naive Bayes Classification: Train a supervised learning model to predict document clusters and evaluate its performance.
- Corpus Exploration: Filter and explore documents interactively by clusters or categories.
- Export Capability: Save the corpus or search results to a file for offline use.
This project uses the Reuters dataset from the NLTK library. Follow these steps to set up the project locally.
Ensure you have Python 3.8+ installed. Install the required Python packages:
pip install -r requirements.txt-
Clone the repo:
git clone https://github.com/codybennett/LIS4693-Project3.git
-
Install Python dependencies:
pip install -r requirements.txt
-
Generate the corpus:
Run the
data_collection.pyscript to process the Reuters dataset and populate the SQLite database:python data_collection.py
This utility analyzes a directory of text documents (corpus). Users can:
- Generate Corpus: Use the
data_collection.pyscript to create the corpus from the Reuters dataset. - Search: Enter a keyword or phrase in the search bar to find relevant documents.
- Cluster: Use K-Means clustering to group documents into clusters and visualize them.
- Filter: Use sidebar options to narrow down results by clusters or categories.
- Explore Results: Expand search results to view document snippets and full content.
- Export: Save the corpus or search results to a text file for offline analysis.
- Perform keyword or phrase searches using TF-IDF for relevance scoring.
- Fallback to full-text search if no TF-IDF matches are found.
- Group documents into clusters based on their similarity.
- Visualize clusters using PCA scatter plots.
- Analyze category distributions within each cluster.
- Train a Naive Bayes Classifier on K-Means clusters.
- Predict the cluster for new or unseen documents.
- Evaluate the classifier's performance using metrics like accuracy, confusion matrix, and classification report.
- Filter documents by clusters or categories.
- View document snippets and full content interactively.
- Save the corpus or search results to a file for offline use.
To interact with the corpus and perform searches, you can run the Streamlit application locally or access it via Streamlit Cloud:
-
Ensure all dependencies are installed:
pip install -r requirements.txt
-
Start the Streamlit application:
streamlit run streamlit_app.py
-
Open the provided URL in your browser to access the application.
The application is also deployed on Streamlit Cloud. You can access it directly without setting up a local environment by visiting the following link:
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the GNU GPL3 License. See LICENSE.txt for more information.
- Cody Bennett - [email protected]
Project Link: https://github.com/codybennett/LIS4693-Project3
This project was developed as part of the LIS4693 course at the University of Oklahoma. Special thanks to the course instructors and teaching assistants for their guidance and support.