The training component of the Swahili Spam Detection system, containing datasets, training notebooks, and model evaluation infrastructure. This repository handles the machine learning aspects of the main SSD project.
└── ssd-training/
└── dataset/ # Training datasets
└── combined_set.csv
└── combined_set.xlsx
└── Model_After_Training/
└── model_after_tranining_28_jan_2025/
└── swahiliSpamDetectionModel.pkl
└── swahiliSpamDetectionModel.pkl
└── stopwords/ # Swahili stopwords
└── Common Swahili Stop-words.csv
└── spamDetectionRef.ipynb # Training notebook
└── requirements.txt # Dependencies
Model | Technique | Accuracy | Precision | Recall | F1-score | AUC-ROC |
---|---|---|---|---|---|---|
Logistic Regression | Count Vectorization | 0.9924 | 0.9925 | 0.9924 | 0.9923 | 0.9996 |
Naive Bayes | Count Vectorization | 0.9904 | 0.9905 | 0.9904 | 0.9905 | 0.9988 |
SVM | Count Vectorization | 0.9933 | 0.9934 | 0.9933 | 0.9933 | 0.9995 |
Random Forest | Count Vectorization | 0.9933 | 0.9933 | 0.9933 | 0.9933 | 0.9993 |
Logistic Regression | TF-IDF | 0.9838 | 0.9842 | 0.9838 | 0.9837 | 0.9995 |
Naive Bayes | TF-IDF | 0.9952 | 0.9952 | 0.9952 | 0.9952 | 0.9985 |
SVM | TF-IDF | 0.9914 | 0.9915 | 0.9914 | 0.9914 | 0.9999 |
Random Forest | TF-IDF | 0.9924 | 0.9925 | 0.9924 | 0.9923 | 0.9998 |
- Python 3.8+
- pip package manager
- Clone the repository
git clone https://github.com/patrick-paul/ssd-training.git
cd ssd-training
- Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Contains over 6,000 labeled Swahili messages
- Binary classification (spam/ham)
- Available in CSV and Excel formats
- Includes regional language variations
Random Forest with Count Vectorization:
- Accuracy: 0.9933
- Precision: 0.9933
- Recall: 0.9933
- F1-score: 0.9933
- AUC-ROC: 0.9993
- Robust against overfitting
- Consistent performance across metrics
- Excellent handling of text features
- Strong performance with Count Vectorization
pandas
matplotlib
numpy
seaborn
nltk
scikit-learn
joblib
- Open the Jupyter notebook:
jupyter notebook spamDetectionRef.ipynb
- Follow the notebook sections:
- Data preprocessing
- Feature extraction
- Model training
- Performance evaluation
- Model export
- Fork the repository
- Create your feature branch
- Follow PEP-8 standards
- Push your changes
- Submit a pull request
- Expand dataset with more regional dialects
- Implement deep learning models
- Add automated model retraining pipeline
- Create comprehensive model testing suite
- Add model versioning system
MIT License - See LICENSE.md for details
- Developer: [email protected]
- GitHub: @patrick-paul
- Project Link: https://github.com/patrick-paul/ssd-training
- Main SSD Application - The web application using these trained models