Master's Thesis Project
Carlos Leon | Hochschule Luzern - Wirtschaft | December 2025
Supervised by: Oliver Staubli (Revolytics)
Client: On AG
This repository contains the implementation of a machine learning-based system for automated classification of products into Harmonized System (HS) codes and assigning of customs duties. The project addresses the challenge of manual, rule-based customs classification by leveraging transformer-based neural networks to predict HS codes from natural language product descriptions.
How can machine learning and natural language processing methods be applied to automate the classification of product descriptions into HS codes and support the analysis of customs compliance in international trade?
- Dual-Model Architecture: Baseline retrieval model + hierarchical neural classifier
- Hierarchical Classification: Predicts at three levels (Chapter โ Heading โ HS6)
- Tariff Integration: Automatic duty rate lookup for Canada, EU, and Switzerland
- Web Interface: Interactive Streamlit application for real-time classification
- High Accuracy: 97.19% validation accuracy on HS6 classification
Multinational companies face significant complexity in customs classification:
- 5,612 unique HS6 codes to navigate globally
- Manual classification is error-prone and resource-intensive
- Static tariff tables don't adapt to regulatory changes
- Misclassification leads to financial penalties and shipment delays
This system automates HS code prediction using state-of-the-art NLP techniques, reducing classification time and improving accuracy.
- Architecture: Semantic similarity retrieval
- Base Model:
sentence-transformers/all-MiniLM-L6-v2(384-dim embeddings) - Method: Cosine similarity search over encoded WCO descriptions
- Advantages: Fast inference (<50ms), no training required, interpretable
- Architecture: Multi-output neural network with three parallel classification heads
- Base Model:
distilbert-base-uncased(6 layers, 768-dim hidden size) - Training Data: 179,000 augmented examples from 5,612 HS6 codes
- Loss Function: Weighted multi-level cross-entropy (0.2 Chapter + 0.3 Heading + 0.5 HS6)
- Performance:
- Chapter (2-digit): 99.77% accuracy
- Heading (4-digit): 98.79% accuracy
- HS6 (6-digit): 97.19% accuracy
- Source: World Customs Organization official descriptions
- Coverage: 5,612 HS6 codes across 96 chapters and 1,228 headings
- Augmentation: ~32 synthetic variations per code using category-specific rules
- Total Examples: 179,184 training samples
- Source: WTO Analytical Database
- Countries: Canada, European Union, Switzerland
- Year: 2024 (HS22 classification)
- Coverage: 16,797 MFN tariff rates
Context-aware rules for generating paraphrases:
- Synonym substitution (horses โ equines, cattle โ bovine)
- Prefix/suffix additions ("imported", "for commercial use")
- Simplification (removing parenthetical clauses)
- Domain-specific templates (Live Animals, Food, Textiles, Machinery, Chemicals)
Python 3.8+
CUDA-capable GPU (recommended for training)- Clone the repository
git clone https://github.com/yourusername/thesis-clean.git
cd thesis-clean- Install dependencies
pip install -r requirements.txt- Download WCO data
python src/data/download_wco.py- Generate augmented training data
python src/data/augment_data.pyTrain Baseline Model:
python src/models/baseline.pyTrain Hierarchical Model:
python src/models/train_hierarchical.py- Training time: 2-4 hours on GPU
- Model size: ~840 MB
- Best model saved at epoch with highest validation HS6 accuracy
cd src/app
streamlit run streamlit_app.pyAccess at: http://localhost:8501
thesis-clean/
โโโ src/
โ โโโ data/
โ โ โโโ download_wco.py # WCO data scraper
โ โ โโโ augment_data.py # Data augmentation script
โ โโโ models/
โ โ โโโ baseline.py # Sentence-BERT retrieval model
โ โ โโโ train_hierarchical.py # Training script
โ โโโ app/
โ โโโ streamlit_app.py # Web interface
โโโ data/
โ โโโ processed/
โ โโโ wco_hs_descriptions.csv # Official HS descriptions
โ โโโ wto_model_can_eu_che.csv # Tariff data
โโโ models/
โ โโโ baseline/
โ โโโ hierarchical/
โ โโโ best_model.pt # Trained model weights (not included due to file size constraints)
โ โโโ label_mappings.json # Class mappings
โ โโโ training_log.txt # Training metrics
โโโ notebooks/
โ โโโ 02_model_eval.ipynb # Model evaluation & analysis
โโโ requirements.txt
โโโ README.md
| Level | Classes | Accuracy |
|---|---|---|
| Chapter (2-digit) | 96 | 99.77% |
| Heading (4-digit) | 1,228 | 98.79% |
| HS6 (6-digit) | 5,612 | 97.19% |
- Convergence: Best model at epoch 4 of 30
- Training Loss: 0.1522 โ 0.0537 (epochs 4 โ 30)
- Validation Loss: 0.0939 โ 0.1510 (mild overfitting after epoch 4)
- Strong on WCO-style descriptions: Both models near-perfect
- Moderate on natural language: Performance degrades on simplified user queries
- Low confidence on ambiguous inputs: Model recognizes distribution shift
- Automated Product Classification: Replace manual HS code lookup
- Customs Compliance: Reduce misclassification penalties
- Duty Estimation: Instant tariff rate lookup for multiple countries
- Supply Chain Optimization: Faster customs clearance through accurate pre-classification
- Trade Analytics: Analyze product portfolios by HS classification
- Framework: sentence-transformers
- Encoding: 384-dimensional dense vectors
- Index: FAISS for efficient similarity search
- Inference: <50ms per query on CPU
- Framework: PyTorch 2.x + Hugging Face Transformers
- Optimizer: AdamW (lr=2e-5, batch_size=32)
- Schedule: Linear decay with 500 warmup steps
- Regularization: Dropout (p=0.1) after CLS pooling
- Hardware: CUDA-enabled GPU with 8GB+ VRAM
- Data Leakage: Augmentation applied before train/validation split may inflate validation metrics
- Distribution Shift: Performance degrades on natural language queries vs. technical WCO descriptions
- No Ablation Studies: Optimal loss weights and architecture variants not tested
- Limited Countries: Tariff data only available for Canada, EU, and Switzerland
- No NLP Integration: Regulatory text analysis not implemented (planned future work)
- Implement proper train/test splitting before augmentation
- Create curated test set of real user queries
- Add precision, recall, F1 metrics for per-class analysis
- Expand tariff coverage to more countries
- Implement hierarchical inference (use chapter to constrain HS6 predictions)
- Fine-tune with domain-specific BERT (LEGAL-BERT)
- Integrate regulatory text analysis (NLP for compliance documents)
- Add multi-language support
- Implement active learning for continuous improvement
- Deploy as production API with authentication
- World Customs Organization: HS Nomenclature
- WTO Tariff Data: Analytical Database
- Hugging Face Transformers: Documentation
- DistilBERT Paper: Sanh et al. (2019) "DistilBERT, a distilled version of BERT"
This project is part of a Master's thesis at Hochschule Luzern.
For academic or commercial use, please contact the author.
Carlos Leon
Master of Science in Applied Information and Data Science
Hochschule Luzern - Wirtschaft
๐ง [email protected]
๐ LinkedIn | GitHub
- Supervisor: Oliver Staubli (Revolytics)
- Client Partner: Sofia Viale (On AG)
- Institution: Hochschule Luzern - Wirtschaft
Special thanks to the World Customs Organization for making HS nomenclature data publicly available, and to the Hugging Face team for their excellent open-source tools.
Last updated: December 2025