NLP with Disaster Tweets 🚨🐦

A natural language processing project that classifies tweets as disaster-related or not, leveraging DistilBERT and multi-GPU training.

🎯 Project Overview

This project tackles the Kaggle NLP Getting Started competition, which challenges participants to build a model that can identify whether a tweet is about a real disaster or not.

The Challenge: Tweets are notoriously difficult to parse—they're short, filled with slang, use inconsistent grammar, and often include sarcasm. A tweet saying "California is on fire!" could be literal (a wildfire) or figurative (great weather). This project explores how transformer-based models handle this ambiguity.

📚 Learning Journey

This project was undertaken while working through the Python Natural Language Processing Cookbook, Second Edition. My goals were to:

Master NLP fundamentals: Deep dive into tokenization, embeddings, and modern NLP architectures
Work with real Kaggle data: Experience the messiness of real-world text classification
Explore transformers: Implement and fine-tune pre-trained models like DistilBERT
Optimize training: Learn to parallelize training across multiple GPUs for faster experimentation

✨ Key Features

DistilBERT Architecture: Leverages a distilled version of BERT for efficient text classification
Multi-GPU Training: Parallelized across 2 T4 GPUs (courtesy of Kaggle) for faster iteration
Hyperparameter Tuning: 5-fold cross-validation to find optimal model configuration
Competition Performance: Achieved F1-score of 0.82 on the competition leaderboard
End-to-End Pipeline: From data preprocessing to model deployment and submission generation

🔬 Technical Approach

1. Data Preprocessing

The dataset consists of tweets labeled as disaster (1) or not disaster (0). Preprocessing steps include:

Text cleaning: Handling URLs, mentions, hashtags, and special characters
Tokenization: Using DistilBERT's tokenizer to convert text into model-ready format
Handling imbalance: Analyzing class distribution and applying appropriate techniques

2. Model Architecture

DistilBERT was chosen for several reasons:

40% smaller than BERT while retaining 97% of its performance
Faster inference time—crucial for real-time disaster detection systems
Pre-trained on a massive corpus, providing strong language understanding
Easy to fine-tune on domain-specific data

Model Configuration:

Base Model: DistilBERT (distilbert-base-uncased)
Classification Head: Linear layer for binary classification
Max Sequence Length: 128 tokens
Dropout: 0.1 for regularization

3. Training Strategy

Hyperparameter Tuning with 5-Fold Cross-Validation:

Learning rates: [2e-5, 3e-5, 5e-5]
Batch sizes: [16, 32]
Epochs: [3, 4, 5]

Multi-GPU Parallelization: Leveraged Kaggle's 2 T4 GPUs to:

Run multiple hyperparameter combinations simultaneously
Reduce total training time by ~50%
Enable more extensive experimentation within time constraints

Optimization:

Optimizer: AdamW with weight decay
Learning rate scheduler: Linear warmup followed by decay
Loss function: Binary cross-entropy

4. Evaluation

The competition evaluates submissions using the F1-score, which balances precision and recall—critical for disaster detection where both false positives and false negatives have consequences.

Final Results:

F1-Score: 0.82
Cross-validation performance: Consistent across all folds, indicating stable model

🛠️ Technologies Used

PyTorch: Deep learning framework
Transformers (Hugging Face): Pre-trained DistilBERT model
Scikit-learn: Cross-validation and metrics
Pandas: Data manipulation
NumPy: Numerical operations
Kaggle API: Dataset access and submission
CUDA: GPU acceleration

🔮 Future Improvements

Ensemble methods: Combine DistilBERT with other transformer models (RoBERTa, ALBERT)
Error analysis: Deep dive into misclassified tweets to identify patterns
Deployment: Create REST API for real-time disaster tweet detection

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
0-82-score-bert-model-parallel-gpu-training.ipynb		0-82-score-bert-model-parallel-gpu-training.ipynb
README.md		README.md
submission.csv		submission.csv
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP with Disaster Tweets 🚨🐦

🎯 Project Overview

📚 Learning Journey

✨ Key Features

🔬 Technical Approach

1. Data Preprocessing

2. Model Architecture

3. Training Strategy

4. Evaluation

🛠️ Technologies Used

🔮 Future Improvements

About

Uh oh!

Releases

Packages

Languages

Anu2711/NLP-with-Disaster-Tweets

Folders and files

Latest commit

History

Repository files navigation

NLP with Disaster Tweets 🚨🐦

🎯 Project Overview

📚 Learning Journey

✨ Key Features

🔬 Technical Approach

1. Data Preprocessing

2. Model Architecture

3. Training Strategy

4. Evaluation

🛠️ Technologies Used

🔮 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages