This project is for the Kaggle MITSUI&CO. Commodity Prediction Challenge. The goal is to develop a robust model for accurate and stable prediction of commodity prices using historical data from LME, JPX, US Stock, and Forex markets.
├── data/ # Raw and processed data files
├── notebooks/ # Jupyter notebooks for EDA, modeling, and experiments
│ ├── eda.ipynb # Exploratory Data Analysis
│ ├── baseline_model.ipynb # Baseline model implementation
│ └── ... # Other modeling notebooks
├── src/ # Source code for data processing, feature engineering, and modeling
│ ├── data_processing.py # Data loading and preprocessing functions
│ ├── models.py # Model wrappers and factory functions
│ ├── ensemble.py # Ensemble methods (averaging, stacking, etc.)
│ ├── feature_selection.py # Feature importance and selection methods
│ ├── cv.py # Time-series cross-validation
│ └── tuning.py # Hyperparameter optimization
├── models/ # Saved model files
├── outputs/ # Submission files and results
├── requirements.txt # Python dependencies
├── setup_project.py # Project setup and dependency checker
├── run_baseline.py # Baseline model runner
└── README.md # This file
# Install dependencies
pip install -r requirements.txt
# Run setup script to check environment and generate sample data
python setup_project.py# Run full competition pipeline with sample data
python run_competition.py
# Or run baseline model
python run_baseline.py# Start Jupyter notebook
jupyter notebook notebooks/eda.ipynb- Python 3.7+
- pip package manager
-
Clone the repository
git clone <repository-url> cd "MITSUI&CO. Commodity Prediction Challenge"
-
Install dependencies
pip install -r requirements.txt
-
Run setup script
python setup_project.py
This will:
- Check Python version and dependencies
- Create necessary directories
- Generate sample data for testing
- Verify custom modules can be imported
-
Download competition data (optional)
- Download from Kaggle Competition
- Place files in
data/directory:train.csvtest.csvtrain_labels.csvtarget_pairs.csv
python run_competition.pyThis will:
- Load and preprocess competition data
- Parse 424 target configurations
- Create target-specific features
- Train multi-target models (LightGBM, XGBoost)
- Generate predictions for all targets
- Create competition submission format
- Save results to
outputs/directory
python run_baseline.pyThis will:
- Load and preprocess data
- Perform feature engineering
- Train baseline models (LightGBM, XGBoost, Ridge)
- Generate ensemble predictions
- Save results to
outputs/directory
jupyter notebook notebooks/eda.ipynbThe EDA notebook provides:
- Data overview and quality analysis
- Time series analysis
- Feature correlation analysis
- Target variable analysis
import sys
sys.path.append('src')
from data_processing import load_data, add_lag_features
from models import get_model
from ensemble import simple_average
from cv import time_series_cv_split- Data loading: Load competition data files
- Missing value handling: Forward fill, backward fill, mean imputation
- Feature engineering:
- Lag features (1, 2, 3 periods)
- Rolling statistics (mean, std, min, max)
- Exponentially weighted moving averages
- Calendar features (day of week, month)
- Cross-asset features (spreads, ratios, correlations)
- Extended rolling features (skew, kurtosis, volatility, momentum)
- Interaction features (pairwise products)
- Target parsing: Parse 424 target configurations from target_pairs.csv
- Target-specific features: Create features for single assets and asset pairs
- Multi-target models: Handle all 424 targets simultaneously
- Competition evaluation: Sharpe ratio variant metric implementation
- Submission format: Generate competition-ready submission files
- LightGBM: Gradient boosting with LightGBM
- XGBoost: Gradient boosting with XGBoost
- CatBoost: Gradient boosting with CatBoost
- Random Forest: Ensemble of decision trees
- Ridge Regression: Linear regression with L2 regularization
- Simple averaging: Equal weights for all models
- Weighted averaging: Custom weights for models
- Stacking: Meta-model on base model predictions
- Out-of-fold stacking: Stacking with cross-validation
- Weight optimization: Optimize weights using scipy
- SHAP importance: SHAP-based feature importance
- Permutation importance: Permutation-based importance
- Recursive feature elimination: RFE for feature selection
- Correlation filtering: Remove highly correlated features
- Time-series CV: Expanding and rolling window splits
- Configurable parameters: Number of splits, test size
- Optuna integration: Bayesian optimization
- LightGBM tuning: Comprehensive parameter search
- Spearman correlation: Competition metric optimization
Predict 424 different targets consisting of:
- Single asset returns: Log returns of individual financial instruments
- Asset pair differences: Price differences between pairs of assets
- Multi-market data: LME (metals), JPX (Japanese), US (stocks), FX (forex)
- 424 targets: Each with different lags (1-4 days) and asset combinations
- Price-difference series: Many targets are differences between asset pairs
- Multi-asset relationships: Cross-market correlations and spreads
- Stability focus: Consistent performance across different market conditions
- Sharpe Ratio Variant: Mean Spearman correlation / Standard deviation
- Stability emphasis: Lower variance in predictions is rewarded
- Multi-target: Performance across all 424 targets
- 1977 columns: Massive feature space from multiple markets
- Time series: Historical data with UTC timestamps
- Market-specific: Different trading days and time zones
- Target pairs: Detailed configuration for each of 424 targets
- Download real competition data and replace sample files
- Run EDA notebook to understand data characteristics
- Execute baseline model to establish performance benchmark
- Analyze results and identify improvement opportunities
- Feature engineering: Implement domain-specific features
- Hyperparameter tuning: Optimize model parameters
- Ensemble methods: Combine multiple models effectively
- Cross-validation: Ensure robust evaluation
- Feature selection: Identify most important features
- Multi-target modeling: Predict multiple commodities simultaneously
- Time series specific models: LSTM, GRU, Transformer models
- Advanced ensembles: Stacking, blending, weighted combinations
- Domain knowledge: Incorporate commodity market insights
- Import errors: Ensure all dependencies are installed
- Data not found: Run
setup_project.pyto generate sample data - Memory issues: Reduce feature set or use data sampling
- Model convergence: Adjust learning rates or regularization
- Check the setup script output for specific error messages
- Review the competition documentation on Kaggle
- Examine the sample code in notebooks and scripts
- Competition Link
- Kaggle Documentation
- LightGBM Documentation
- XGBoost Documentation
- Optuna Documentation
This project is for educational and competition purposes. Please refer to the Kaggle competition terms and conditions.