A comprehensive machine learning experiment framework for forecasting with five different models (LSTM, Transformer, Prophet, ARIMA, XGBoost) including compliance checks, uncertainty estimation, and cost-optimized deployment on Google Cloud Platform.
This framework implements a reproducible A/B/C/D/E experiment where:
- Group A: LSTM (1.00% of dataset)
- Group B: Lightweight Transformer (1.25% of dataset)
- Group C: Prophet (1.25% of dataset)
- Group D: ARIMA (1.25% of dataset)
- Group E: XGBoost (1.25% of dataset)
Each model includes uncertainty estimation, explainability features, and compliance/governance checks for financial applications.
βββ model_lstm.py # LSTM model implementation
βββ model_transformer.py # Transformer model implementation
βββ model_prophet.py # Prophet model implementation
βββ model_arima.py # ARIMA model implementation
βββ model_xgboost.py # XGBoost model implementation
βββ run_abtest.py # Experiment orchestrator
βββ app.py # FastAPI service with compliance checks
βββ bigquery_sampling.sql # SQL for creating sample groups
βββ requirements.txt # Python dependencies
βββ Dockerfile # Multi-stage container build
βββ gce_startup.sh # GCE startup script for preemptible instances
βββ tests/
β βββ test_models.py # Unit tests for all models
βββ README.md # This file
# Clone or download the project
git clone <repository-url>
cd ml-abtest
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtSet the following environment variables:
export PROJECT_ID="your-gcp-project-id"
export BQ_DATASET="your_dataset"
export FEATURE_TABLE="your_table"
export GCS_BUCKET="your-bucket"
export ABTEST_SEED="42"
export SAMPLE_LIMIT_PER_GROUP="2000"
export AUDIT_SIGNING_KEY_SECRET="audit-signing-key"
export BUDGET_DOLLARS="10"
export PREEMPTIBLE="true"
export FULL_TRAIN="false"Run the sampling SQL to create experiment groups:
# Replace variables in the SQL file
sed -i "s/{PROJECT_ID}/$PROJECT_ID/g" bigquery_sampling.sql
sed -i "s/{BQ_DATASET}/$BQ_DATASET/g" bigquery_sampling.sql
sed -i "s/{FEATURE_TABLE}/$FEATURE_TABLE/g" bigquery_sampling.sql
# Execute the SQL
bq query --use_legacy_sql=false < bigquery_sampling.sql# Run unit tests
python -m pytest tests/ -v
# Run individual model test
python model_lstm.py
# Run full experiment locally (with synthetic data)
python run_abtest.pyThe framework is designed for tight budgets with several cost-saving features:
- Preemptible Instances: Use preemptible GCE instances for training
- Sample Limiting: Default 2K samples per group (configurable)
- Checkpointing: Frequent saves to GCS to handle preemption
- Resource Limits: CPU-only training, minimal memory usage
# Create preemptible instance with startup script
gcloud compute instances create ml-abtest-instance \
--zone=us-central1-a \
--machine-type=e2-medium \
--preemptible \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata-from-file startup-script=gce_startup.sh \
--metadata PROJECT_ID=$PROJECT_ID,BQ_DATASET=$BQ_DATASET,GCS_BUCKET=$GCS_BUCKET# Build and deploy API service
docker build --target api-minimal -t gcr.io/$PROJECT_ID/ml-abtest-api .
docker push gcr.io/$PROJECT_ID/ml-abtest-api
gcloud run deploy ml-abtest-api \
--image gcr.io/$PROJECT_ID/ml-abtest-api \
--platform managed \
--region us-central1 \
--memory 2Gi \
--cpu 2 \
--max-instances 10 \
--allow-unauthenticated# Local execution with synthetic data
python run_abtest.py
# With BigQuery data (requires authentication)
export USE_BIGQUERY=true
python run_abtest.py
# Custom configuration
export SAMPLE_LIMIT_PER_GROUP=1000
export ABTEST_SEED=123
python run_abtest.py# Start API service
python app.py
# Test scoring endpoint
curl -X POST "http://localhost:8080/score" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-token" \
-d '{
"to_address": "1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa",
"amount_btc": 0.001,
"metadata": {
"sender_country": "US",
"receiver_country": "CA",
"sender_vasp_id": "VASP123",
"receiver_vasp_id": "VASP456"
}
}'curl http://localhost:8080/health- Architecture: 2-layer LSTM with dropout
- Features: Time series sequences with trend/seasonality
- Uncertainty: Monte Carlo Dropout
- Use Case: Complex temporal patterns
- Architecture: Lightweight transformer with positional encoding
- Features: Multi-head attention with global pooling
- Uncertainty: Monte Carlo Dropout
- Use Case: Long-range dependencies
- Architecture: Facebook Prophet with seasonal decomposition
- Features: Trend, seasonality, holidays
- Uncertainty: Built-in confidence intervals
- Use Case: Business time series with seasonality
- Architecture: Auto-regressive integrated moving average
- Features: Automatic order selection
- Uncertainty: Prediction intervals
- Use Case: Stationary time series
- Architecture: Gradient boosting with feature engineering
- Features: Lag features, rolling statistics, derived features
- Uncertainty: Feature-based variance estimation
- Use Case: Tabular data with complex interactions
- Local SQLite database with OFAC/SDN data
- Exact and fuzzy matching
- Configurable risk weights
- Cross-border transaction detection
- VASP metadata validation
- Missing field identification
- Ensemble risk scoring:
risk = w1*|z| + w2*sanctions*100 + w3*travel_rule*10 + w4*ml_risk - Configurable decision thresholds
- Audit logging with HMAC signatures
export RISK_WEIGHT_1="0.4" # Z-score weight
export RISK_WEIGHT_2="0.3" # Sanctions weight
export RISK_WEIGHT_3="0.2" # Travel rule weight
export RISK_WEIGHT_4="0.1" # ML risk weightAll decisions are logged to BigQuery with:
- Timestamp, request ID, address, amount
- Model predictions and uncertainty
- Policy decision and reasoning
- HMAC-signed decision blob
- Loss curves saved to GCS
- Model artifacts with timestamps
- Performance metrics (MAE, RMSE, F1, etc.)
- API health checks
- Model availability status
- Compliance system status
# Run all tests
python -m pytest tests/ -v
# Run specific model test
python -m pytest tests/test_models.py::TestLSTMModel -v
# Run with coverage
python -m pytest tests/ --cov=. --cov-report=html# Test all models with same data
python -m pytest tests/test_models.py::TestModelIntegration -v
# Test API endpoints
python -m pytest tests/ -k "api" -v-
Set up GCP project and authentication:
gcloud auth login gcloud config set project $PROJECT_ID
-
Create BigQuery dataset and tables:
bq mk $BQ_DATASET bq query --use_legacy_sql=false < bigquery_sampling.sql
-
Create GCS bucket:
gsutil mb gs://$GCS_BUCKET -
Set up Secret Manager:
echo "your-audit-signing-key" | gcloud secrets create audit-signing-key --data-file=-
-
Build and push Docker image:
docker build --target training -t gcr.io/$PROJECT_ID/ml-abtest . docker push gcr.io/$PROJECT_ID/ml-abtest
-
Create preemptible GCE instance:
gcloud compute instances create ml-abtest --zone=us-central1-a \ --machine-type=e2-medium --preemptible \ --metadata-from-file startup-script=gce_startup.sh -
Monitor training progress:
gcloud compute ssh ml-abtest --zone=us-central1-a --command="tail -f /var/log/ml-abtest.log" -
Deploy API service:
docker build --target api-minimal -t gcr.io/$PROJECT_ID/ml-abtest-api . docker push gcr.io/$PROJECT_ID/ml-abtest-api gcloud run deploy ml-abtest-api --image gcr.io/$PROJECT_ID/ml-abtest-api
-
Test API endpoint:
curl -X POST "https://ml-abtest-api-url/score" \ -H "Content-Type: application/json" \ -d '{"to_address": "test", "amount_btc": 0.001}'
-
Check results in GCS:
gsutil ls gs://$GCS_BUCKET/results/ gsutil ls gs://$GCS_BUCKET/models/
- Modify
SAMPLE_LIMIT_PER_GROUPto change dataset size per group - Adjust seeds in
bigquery_sampling.sqlfor different random samples - Change sampling percentages in SQL (currently 1.00% for A, 1.25% for B-E)
- LSTM: Adjust
sequence_length,hidden_size,num_layers - Transformer: Modify
d_model,nhead,num_layers - Prophet: Change seasonality settings and priors
- ARIMA: Enable/disable
auto_arimaor set custom orders - XGBoost: Adjust
n_estimators,max_depth, feature engineering
- Set
BUDGET_DOLLARSto automatically adjust resource limits - Use
PREEMPTIBLE=truefor cost savings - Enable
FULL_TRAIN=falsefor smaller training runs
-
BigQuery Permission Errors:
gcloud auth application-default login export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account.json"
-
Memory Issues:
- Reduce
SAMPLE_LIMIT_PER_GROUP - Use smaller model architectures
- Enable gradient checkpointing
- Reduce
-
Preemption Handling:
- Check GCS for partial results
- Restart training from checkpoints
- Use smaller batch sizes
-
Model Loading Errors:
- Verify model files exist in GCS
- Check file permissions
- Use correct model paths
# Check training logs
gsutil cat gs://$GCS_BUCKET/logs/abtest_*.log
# Check instance logs
gcloud compute instances get-serial-port-output ml-abtest --zone=us-central1-a
# Check API logs
gcloud run logs read ml-abtest-api --region=us-central1- Python 3.9+
- PyTorch 2.0+ (CPU or GPU)
- scikit-learn, pandas, numpy
- Prophet, statsmodels, XGBoost
- FastAPI, uvicorn
- Google Cloud libraries
- GPU support for faster training
- pmdarima for advanced ARIMA
- tsfresh for feature extraction
- arch for econometric models
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review logs in GCS
- Run unit tests to verify setup
- Create an issue with detailed error information
Cost Note: This framework is optimized for a $10 budget. Actual costs may vary based on data size, training duration, and GCP pricing. Monitor costs using GCP billing alerts.