GitHub - codybennett/LIS4693-Project3

LIS4693-Project3

Project 3 for University of Oklahoma LIS4693
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Features
Running the Streamlit Application
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

This project is an advanced information retrieval and text mining system designed for analyzing and exploring text data. It allows users to search, filter, cluster, and explore a corpus of documents interactively. The system provides tools for efficient text retrieval, clustering, classification, and document exploration.

Key features include:

Search Functionality: Perform keyword or phrase searches across the corpus using TF-IDF for relevance scoring and fallback full-text search.
K-Means Clustering: Group documents into clusters and visualize them using PCA scatter plots and category distributions.
Naive Bayes Classification: Train a supervised learning model to predict document clusters and evaluate its performance.
Corpus Exploration: Filter and explore documents interactively by clusters or categories.
Export Capability: Save the corpus or search results to a file for offline use.

(back to top)

Built With

(back to top)

Getting Started

This project uses the Reuters dataset from the NLTK library. Follow these steps to set up the project locally.

Prerequisites

Ensure you have Python 3.8+ installed. Install the required Python packages:

pip install -r requirements.txt

Installation

Clone the repo:

git clone https://github.com/codybennett/LIS4693-Project3.git

Install Python dependencies:
```
pip install -r requirements.txt
```
Generate the corpus:

Run the data_collection.py script to process the Reuters dataset and populate the SQLite database:
```
python data_collection.py
```

(back to top)

Usage

This utility analyzes a directory of text documents (corpus). Users can:

Generate Corpus: Use the data_collection.py script to create the corpus from the Reuters dataset.
Search: Enter a keyword or phrase in the search bar to find relevant documents.
Cluster: Use K-Means clustering to group documents into clusters and visualize them.
Filter: Use sidebar options to narrow down results by clusters or categories.
Explore Results: Expand search results to view document snippets and full content.
Export: Save the corpus or search results to a text file for offline analysis.

(back to top)

Features

Search Functionality

Perform keyword or phrase searches using TF-IDF for relevance scoring.
Fallback to full-text search if no TF-IDF matches are found.

K-Means Clustering

Group documents into clusters based on their similarity.
Visualize clusters using PCA scatter plots.
Analyze category distributions within each cluster.

Naive Bayes Classification

Train a Naive Bayes Classifier on K-Means clusters.
Predict the cluster for new or unseen documents.
Evaluate the classifier's performance using metrics like accuracy, confusion matrix, and classification report.

Corpus Exploration

Filter documents by clusters or categories.
View document snippets and full content interactively.

Export Capability

Save the corpus or search results to a file for offline use.

(back to top)

Running the Streamlit Application

To interact with the corpus and perform searches, you can run the Streamlit application locally or access it via Streamlit Cloud:

Local Setup

Ensure all dependencies are installed:
```
pip install -r requirements.txt
```
Start the Streamlit application:
```
streamlit run streamlit_app.py
```
Open the provided URL in your browser to access the application.

Streamlit Cloud

The application is also deployed on Streamlit Cloud. You can access it directly without setting up a local environment by visiting the following link:

Streamlit Cloud Deployment

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the GNU GPL3 License. See LICENSE.txt for more information.

(back to top)

Contact

Cody Bennett - [email protected]

Project Link: https://github.com/codybennett/LIS4693-Project3

(back to top)

Acknowledgments

This project was developed as part of the LIS4693 course at the University of Oklahoma. Special thanks to the course instructors and teaching assistants for their guidance and support.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
mycorpus		mycorpus
pages		pages
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
analysis.db		analysis.db
log_config.py		log_config.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIS4693-Project3

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Features

Search Functionality

K-Means Clustering

Naive Bayes Classification

Corpus Exploration

Export Capability

Running the Streamlit Application

Local Setup

Streamlit Cloud

Contributing

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

codybennett/LIS4693-Project3

Folders and files

Latest commit

History

Repository files navigation

LIS4693-Project3

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Features

Search Functionality

K-Means Clustering

Naive Bayes Classification

Corpus Exploration

Export Capability

Running the Streamlit Application

Local Setup

Streamlit Cloud

Contributing

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages