A natural language processing project that classifies tweets as disaster-related or not, leveraging DistilBERT and multi-GPU training.
This project tackles the Kaggle NLP Getting Started competition, which challenges participants to build a model that can identify whether a tweet is about a real disaster or not.
The Challenge: Tweets are notoriously difficult to parse—they're short, filled with slang, use inconsistent grammar, and often include sarcasm. A tweet saying "California is on fire!" could be literal (a wildfire) or figurative (great weather). This project explores how transformer-based models handle this ambiguity.
This project was undertaken while working through the Python Natural Language Processing Cookbook, Second Edition. My goals were to:
- Master NLP fundamentals: Deep dive into tokenization, embeddings, and modern NLP architectures
- Work with real Kaggle data: Experience the messiness of real-world text classification
- Explore transformers: Implement and fine-tune pre-trained models like DistilBERT
- Optimize training: Learn to parallelize training across multiple GPUs for faster experimentation
- DistilBERT Architecture: Leverages a distilled version of BERT for efficient text classification
- Multi-GPU Training: Parallelized across 2 T4 GPUs (courtesy of Kaggle) for faster iteration
- Hyperparameter Tuning: 5-fold cross-validation to find optimal model configuration
- Competition Performance: Achieved F1-score of 0.82 on the competition leaderboard
- End-to-End Pipeline: From data preprocessing to model deployment and submission generation
The dataset consists of tweets labeled as disaster (1) or not disaster (0). Preprocessing steps include:
- Text cleaning: Handling URLs, mentions, hashtags, and special characters
- Tokenization: Using DistilBERT's tokenizer to convert text into model-ready format
- Handling imbalance: Analyzing class distribution and applying appropriate techniques
DistilBERT was chosen for several reasons:
- 40% smaller than BERT while retaining 97% of its performance
- Faster inference time—crucial for real-time disaster detection systems
- Pre-trained on a massive corpus, providing strong language understanding
- Easy to fine-tune on domain-specific data
Model Configuration:
Base Model: DistilBERT (distilbert-base-uncased)
Classification Head: Linear layer for binary classification
Max Sequence Length: 128 tokens
Dropout: 0.1 for regularization
Hyperparameter Tuning with 5-Fold Cross-Validation:
- Learning rates: [2e-5, 3e-5, 5e-5]
- Batch sizes: [16, 32]
- Epochs: [3, 4, 5]
Multi-GPU Parallelization: Leveraged Kaggle's 2 T4 GPUs to:
- Run multiple hyperparameter combinations simultaneously
- Reduce total training time by ~50%
- Enable more extensive experimentation within time constraints
Optimization:
- Optimizer: AdamW with weight decay
- Learning rate scheduler: Linear warmup followed by decay
- Loss function: Binary cross-entropy
The competition evaluates submissions using the F1-score, which balances precision and recall—critical for disaster detection where both false positives and false negatives have consequences.
Final Results:
- F1-Score: 0.82
- Cross-validation performance: Consistent across all folds, indicating stable model
- PyTorch: Deep learning framework
- Transformers (Hugging Face): Pre-trained DistilBERT model
- Scikit-learn: Cross-validation and metrics
- Pandas: Data manipulation
- NumPy: Numerical operations
- Kaggle API: Dataset access and submission
- CUDA: GPU acceleration
- Ensemble methods: Combine DistilBERT with other transformer models (RoBERTa, ALBERT)
- Error analysis: Deep dive into misclassified tweets to identify patterns
- Deployment: Create REST API for real-time disaster tweet detection