Multilingual Cyberbullying Detection

A machine learning project to detect cyberbullying across multiple languages using mBERT (Multilingual BERT) and sentiment analysis. This project focuses on providing a context-aware and language-agnostic solution for detecting harmful online behavior in Hindi, Marathi, and Malayalam.

With the rise of social media, cyberbullying has become a critical issue affecting individuals across diverse backgrounds and age groups. This project aims to create a robust solution capable of identifying cyberbullying in various languages, providing insights into harmful interactions and promoting safer online spaces.

Screenshots

Features

Multilingual Support: Supports cyberbullying detection in Hindi, Marathi, and Malayalam.
Sentiment Analysis Integration: Analyzes the emotional tone of input text to improve accuracy.
Real-Time Classification: Classifies text as cyberbullying or non-cyberbullying instantly.
Scalable Architecture: Ready for deployment on platforms like Heroku or AWS.

Project Architecture

The model architecture consists of several stages:

Input Layer: Receives text data in Hindi, Marathi, or Malayalam.
Preprocessing: Cleans and normalizes the text by removing special characters, stop words, and performing lemmatization.Cleans and normalizes the text by removing special characters, stop words, and performing lemmatization.
Feature Extraction:
- TF-IDF: Extracts term frequency-inverse document frequency values.
- mBERT Embeddings: Generates multilingual embeddings to capture language context.
Logistic Regression Classifier: Combines TF-IDF and mBERT features for binary classification.
Output Layer: Outputs binary labels for cyberbullying or non-cyberbullying.

Datasets

Source: Data collected from social media platforms for sentiment analysis, with samples labeled as cyberbullying or non-cyberbullying.
Languages: Hindi, Marathi, and Malayalam.
Size: Over 30,000 labeled text samples to ensure diverse representation and robust model training.

Preprocessing

The preprocessing pipeline includes:

Text Cleaning: Removing special characters, URLs, hashtags, and mentions.
Lowercasing: Converting text to lowercase.
Stop Words Removal: Eliminating common, irrelevant words.
Lemmatization: Reducing words to their base forms.

Model

The project uses:

mBERT (Multilingual BERT): A transformer model that provides robust, multilingual embeddings.
Logistic Regression: A classifier trained on combined TF-IDF and mBERT embeddings to distinguish cyberbullying from non-cyberbullying effectively.

Evaluation Metrics

The model is evaluated on:

Accuracy
Precision
Recall
F1-Score

These metrics ensure balanced performance across classes and highlight the model’s ability to minimize false positives and negatives.

Installation

Clone the repository

git clone https://github.com/yourusername/Multilingual-Cyberbullying-Detection.git
cd Multilingual-Cyberbullying-Detection

Install Dependencies
```
pip install -r requirements.txt
```

Download mBERT Model (if not included in the repository)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
model = AutoModel.from_pretrained("bert-base-multilingual-cased")

Usage

Run the Preprocessing Script
```
 python preprocess.py
```
Train the Model
```
python train.py
```
Test the Model Input text samples to see classification results for cyberbullying detection.

Future Work

Deep Learning Integration: Experiment with LSTM, CNN, or hybrid models for enhanced accuracy.
Real-Time Monitoring: Develop capabilities for live monitoring of social media content.
Expanded Language Support: Extend the model to detect cyberbullying in additional languages.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Screenshots		Screenshots
Cyberbullying_detection_multilingual_mbert.ipynb		Cyberbullying_detection_multilingual_mbert.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Cyberbullying Detection

Table of Contents

Introduction

Screenshots

Features

Project Architecture

Datasets

Preprocessing

Model

Evaluation Metrics

Installation

Usage

Future Work

Authors

About

Releases

Packages

Contributors 4

Languages

License

its-manishks/Cyberbullying_detection_multilingual_mbert

Folders and files

Latest commit

History

Repository files navigation

Multilingual Cyberbullying Detection

Table of Contents

Introduction

Screenshots

Features

Project Architecture

Datasets

Preprocessing

Model

Evaluation Metrics

Installation

Usage

Future Work

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages