Skip to content

codybennett/LIS4693-Project3

Repository files navigation

Contributors Forks Stargazers Issues GNU License


Logo

LIS4693-Project3

Project 3 for University of Oklahoma LIS4693
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Features
  5. Running the Streamlit Application
  6. Roadmap
  7. Contributing
  8. License
  9. Contact
  10. Acknowledgments

About The Project

This project is an advanced information retrieval and text mining system designed for analyzing and exploring text data. It allows users to search, filter, cluster, and explore a corpus of documents interactively. The system provides tools for efficient text retrieval, clustering, classification, and document exploration.

Key features include:

  • Search Functionality: Perform keyword or phrase searches across the corpus using TF-IDF for relevance scoring and fallback full-text search.
  • K-Means Clustering: Group documents into clusters and visualize them using PCA scatter plots and category distributions.
  • Naive Bayes Classification: Train a supervised learning model to predict document clusters and evaluate its performance.
  • Corpus Exploration: Filter and explore documents interactively by clusters or categories.
  • Export Capability: Save the corpus or search results to a file for offline use.

(back to top)

Built With

  • Python
  • Streamlit
  • SQLite
  • NLTK
  • Scikit-learn

(back to top)

Getting Started

This project uses the Reuters dataset from the NLTK library. Follow these steps to set up the project locally.

Prerequisites

Ensure you have Python 3.8+ installed. Install the required Python packages:

pip install -r requirements.txt

Installation

  1. Clone the repo:

    git clone https://github.com/codybennett/LIS4693-Project3.git
  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Generate the corpus:

    Run the data_collection.py script to process the Reuters dataset and populate the SQLite database:

    python data_collection.py

(back to top)

Usage

This utility analyzes a directory of text documents (corpus). Users can:

  1. Generate Corpus: Use the data_collection.py script to create the corpus from the Reuters dataset.
  2. Search: Enter a keyword or phrase in the search bar to find relevant documents.
  3. Cluster: Use K-Means clustering to group documents into clusters and visualize them.
  4. Filter: Use sidebar options to narrow down results by clusters or categories.
  5. Explore Results: Expand search results to view document snippets and full content.
  6. Export: Save the corpus or search results to a text file for offline analysis.

(back to top)

Features

Search Functionality

  • Perform keyword or phrase searches using TF-IDF for relevance scoring.
  • Fallback to full-text search if no TF-IDF matches are found.

K-Means Clustering

  • Group documents into clusters based on their similarity.
  • Visualize clusters using PCA scatter plots.
  • Analyze category distributions within each cluster.

Naive Bayes Classification

  • Train a Naive Bayes Classifier on K-Means clusters.
  • Predict the cluster for new or unseen documents.
  • Evaluate the classifier's performance using metrics like accuracy, confusion matrix, and classification report.

Corpus Exploration

  • Filter documents by clusters or categories.
  • View document snippets and full content interactively.

Export Capability

  • Save the corpus or search results to a file for offline use.

(back to top)

Running the Streamlit Application

To interact with the corpus and perform searches, you can run the Streamlit application locally or access it via Streamlit Cloud:

Local Setup

  1. Ensure all dependencies are installed:

    pip install -r requirements.txt
  2. Start the Streamlit application:

    streamlit run streamlit_app.py
  3. Open the provided URL in your browser to access the application.

Streamlit Cloud

The application is also deployed on Streamlit Cloud. You can access it directly without setting up a local environment by visiting the following link:

Streamlit Cloud Deployment

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the GNU GPL3 License. See LICENSE.txt for more information.

(back to top)

Contact

Project Link: https://github.com/codybennett/LIS4693-Project3

(back to top)

Acknowledgments

This project was developed as part of the LIS4693 course at the University of Oklahoma. Special thanks to the course instructors and teaching assistants for their guidance and support.

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages