Swahili Spam Detection - Training Repository

The training component of the Swahili Spam Detection system, containing datasets, training notebooks, and model evaluation infrastructure. This repository handles the machine learning aspects of the main SSD project.

Project Structure

└── ssd-training/
    └── dataset/             # Training datasets
        └── combined_set.csv
        └── combined_set.xlsx
    └── Model_After_Training/
        └── model_after_tranining_28_jan_2025/
            └── swahiliSpamDetectionModel.pkl
        └── swahiliSpamDetectionModel.pkl
    └── stopwords/           # Swahili stopwords
        └── Common Swahili Stop-words.csv
    └── spamDetectionRef.ipynb   # Training notebook
    └── requirements.txt         # Dependencies

Model Performance

Performance Metrics

Model	Technique	Accuracy	Precision	Recall	F1-score	AUC-ROC
Logistic Regression	Count Vectorization	0.9924	0.9925	0.9924	0.9923	0.9996
Naive Bayes	Count Vectorization	0.9904	0.9905	0.9904	0.9905	0.9988
SVM	Count Vectorization	0.9933	0.9934	0.9933	0.9933	0.9995
Random Forest	Count Vectorization	0.9933	0.9933	0.9933	0.9933	0.9993
Logistic Regression	TF-IDF	0.9838	0.9842	0.9838	0.9837	0.9995
Naive Bayes	TF-IDF	0.9952	0.9952	0.9952	0.9952	0.9985
SVM	TF-IDF	0.9914	0.9915	0.9914	0.9914	0.9999
Random Forest	TF-IDF	0.9924	0.9925	0.9924	0.9923	0.9998

Getting Started

Prerequisites

Python 3.8+
pip package manager

Installation

Clone the repository

git clone https://github.com/patrick-paul/ssd-training.git
cd ssd-training

Create virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Dataset Information

Contains over 6,000 labeled Swahili messages
Binary classification (spam/ham)
Available in CSV and Excel formats
Includes regional language variations

Model Details

Best Performing Model

Random Forest with Count Vectorization:

Accuracy: 0.9933
Precision: 0.9933
Recall: 0.9933
F1-score: 0.9933
AUC-ROC: 0.9993

Key Features

Robust against overfitting
Consistent performance across metrics
Excellent handling of text features
Strong performance with Count Vectorization

Dependencies

pandas
matplotlib
numpy
seaborn
nltk
scikit-learn
joblib

Usage

Open the Jupyter notebook:

jupyter notebook spamDetectionRef.ipynb

Follow the notebook sections:
- Data preprocessing
- Feature extraction
- Model training
- Performance evaluation
- Model export

Contributing

Fork the repository
Create your feature branch
Follow PEP-8 standards
Push your changes
Submit a pull request

Future Improvements

Expand dataset with more regional dialects
Implement deep learning models
Add automated model retraining pipeline
Create comprehensive model testing suite
Add model versioning system

License

MIT License - See LICENSE.md for details

Contact

Developer: [email protected]
GitHub: @patrick-paul
Project Link: https://github.com/patrick-paul/ssd-training

Related Projects

Main SSD Application - The web application using these trained models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swahili Spam Detection - Training Repository

Project Structure

Model Performance

Performance Metrics

Getting Started

Prerequisites

Installation

Dataset Information

Model Details

Best Performing Model

Key Features

Dependencies

Usage

Contributing

Future Improvements

License

Contact

Related Projects

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Model_After_Training		Model_After_Training
dataset		dataset
stopwords		stopwords
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
spamDetectionRef.ipynb		spamDetectionRef.ipynb
ssd-modal-file-structure-1.md		ssd-modal-file-structure-1.md

License

patrick-paul/ssd-training

Folders and files

Latest commit

History

Repository files navigation

Swahili Spam Detection - Training Repository

Project Structure

Model Performance

Performance Metrics

Getting Started

Prerequisites

Installation

Dataset Information

Model Details

Best Performing Model

Key Features

Dependencies

Usage

Contributing

Future Improvements

License

Contact

Related Projects

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages