A machine learning project to detect cyberbullying across multiple languages using mBERT (Multilingual BERT) and sentiment analysis. This project focuses on providing a context-aware and language-agnostic solution for detecting harmful online behavior in Hindi, Marathi, and Malayalam.
- Introduction
- Screenshots
- Features
- Project Architecture
- Datasets
- Running
- Preprocessing
- Model
- Evaluation Metrics
- Installation
- Usage
- Future Work
- Authors
With the rise of social media, cyberbullying has become a critical issue affecting individuals across diverse backgrounds and age groups. This project aims to create a robust solution capable of identifying cyberbullying in various languages, providing insights into harmful interactions and promoting safer online spaces.
- Multilingual Support: Supports cyberbullying detection in Hindi, Marathi, and Malayalam.
- Sentiment Analysis Integration: Analyzes the emotional tone of input text to improve accuracy.
- Real-Time Classification: Classifies text as cyberbullying or non-cyberbullying instantly.
- Scalable Architecture: Ready for deployment on platforms like Heroku or AWS.
The model architecture consists of several stages:
- Input Layer: Receives text data in Hindi, Marathi, or Malayalam.
- Preprocessing: Cleans and normalizes the text by removing special characters, stop words, and performing lemmatization.Cleans and normalizes the text by removing special characters, stop words, and performing lemmatization.
- Feature Extraction:
- TF-IDF: Extracts term frequency-inverse document frequency values.
- mBERT Embeddings: Generates multilingual embeddings to capture language context.
- Logistic Regression Classifier: Combines TF-IDF and mBERT features for binary classification.
- Output Layer: Outputs binary labels for cyberbullying or non-cyberbullying.
- Source: Data collected from social media platforms for sentiment analysis, with samples labeled as cyberbullying or non-cyberbullying.
- Languages: Hindi, Marathi, and Malayalam.
- Size: Over 30,000 labeled text samples to ensure diverse representation and robust model training.
The preprocessing pipeline includes:
- Text Cleaning: Removing special characters, URLs, hashtags, and mentions.
- Lowercasing: Converting text to lowercase.
- Stop Words Removal: Eliminating common, irrelevant words.
- Lemmatization: Reducing words to their base forms.
The project uses:
- mBERT (Multilingual BERT): A transformer model that provides robust, multilingual embeddings.
- Logistic Regression: A classifier trained on combined TF-IDF and mBERT embeddings to distinguish cyberbullying from non-cyberbullying effectively.
The model is evaluated on:
- Accuracy
- Precision
- Recall
- F1-Score
These metrics ensure balanced performance across classes and highlight the model’s ability to minimize false positives and negatives.
-
Clone the repository
git clone https://github.com/yourusername/Multilingual-Cyberbullying-Detection.git cd Multilingual-Cyberbullying-Detection
-
Install Dependencies
pip install -r requirements.txt
-
Download mBERT Model (if not included in the repository)
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased") model = AutoModel.from_pretrained("bert-base-multilingual-cased")
-
Run the Preprocessing Script
python preprocess.py
-
Train the Model
python train.py
-
Test the Model Input text samples to see classification results for cyberbullying detection.
- Deep Learning Integration: Experiment with LSTM, CNN, or hybrid models for enhanced accuracy.
- Real-Time Monitoring: Develop capabilities for live monitoring of social media content.
- Expanded Language Support: Extend the model to detect cyberbullying in additional languages.