Cyber Risk ML Training is a production-ready machine learning system that intelligently scores CVE (Common Vulnerabilities and Exposures) risks using multi-tier data enrichment and advanced ML models. The system processes vulnerability data from 8+ public sources, enriches it with 28 contextual features, and provides real-time risk predictions via a REST API. This system can be used as is in stand alone state to query single or multiple CVEs. For best value use this system as a middle/app layer in your risk assessment platform.
| Feature | Details |
|---|---|
| 28-Feature Enrichment | TIER 1-3 data integration (CISA, NVD, GitHub, OTX, Metasploit) |
| Dual ML Models | XGBoost Regressor (risk scoring) + Classifier (severity classification) |
| Production API | FastAPI with Swagger UI, health checks, batch predictions |
| High Accuracy | Test MAE=0.0058, RΒ²=0.9806, Accuracy=100% |
| Normalized Scoring | 0-1 probability scale with confidence metrics |
| 500 Enriched CVEs | Complete dataset for training and testing |
| Transparent Data Lineage | Track which enrichment tier provided each feature |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CYBER RISK ML SYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA SOURCES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TIER 1 (Public APIs) β TIER 2 (Enhanced) β TIER 3 (Intel) β
β β’ CISA KEV β β’ NVD CPE Data β β’ Metasploit β
β β’ Exploit-DB β β’ GitHub Advisories β β’ Censys β
β β’ OSV Database β β’ AlienVault OTX β β’ CVSS Severityβ
β (8 features) β (10 features) β (9 features) β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββββββββ
β β β
ββββββββββββββββ΄βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENRICHMENT PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β enhance_cves_tier1.py βββΊ enhance_cves_tier2.py βββΊ enhance_cves_tier3.py β
β (500 Γ 14 cols) (500 Γ 24 cols) (500 Γ 33 cols) β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE ENGINEERING β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Data cleaning & normalization β
β β’ Categorical encoding (attack_vector, ecosystem, rank) β
β β’ Feature selection (28 most predictive features) β
β β’ Train/Test split (80/20) β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL TRAINING (v3) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ XGBRegressor: Risk Score (0-1 probability) β
β ββ Test MAE=0.0058, RΒ²=0.9806 β
β β’ XGBClassifier: Severity (0-4: Low/Med/High/Critical) β
β ββ Test Accuracy=100%, F1=1.0 β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRODUCTION DEPLOYMENT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FastAPI Server (deploy_model_v3.py) β
β β’ POST /predict β CVE risk scoring β
β β’ GET /health β Model health check β
β β’ GET /docs β Swagger UI β
β β’ GET / β API info β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API CONSUMERS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Security Tools β’ SOAR Platforms β’ Compliance β
β β’ Ticket Systems β’ Dashboards β’ Automations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
START
β
βββΊ User/System provides CVE ID (e.g., CVE-2026-20127)
β
βββΊ API checks: Is this CVE in enriched dataset?
β
ββYESββββββββββββββββββββββββββββββββββ
β β’ Load 28 enriched features β
β β’ Pass to XGBoost models β
β β’ Return full prediction β
β data_source: "enriched_dataset"
β features_available: 28
β
ββNOβββββββββββββββββββββββββββββββββββ
β β’ Fetch from NVD + EPSS APIs β
β β’ Extract 3 basic features β
β β’ Pass to XGBoost models β
β β’ Return live prediction β
β data_source: "nvd_live"
β features_available: 3
β
βββββββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Prediction Response β
ββββββββββββββββββββββββ€
β β’ Risk Score (0-1) β
β β’ Severity Label β
β β’ Confidence Score β
β β’ Data Source β
β β’ Features Used β
ββββββββββββ¬ββββββββββββ
β
βΌ
Return to Consumer
β
βββΊ Security Dashboard
βββΊ Alert System
βββΊ Remediation Tool
βββΊ Compliance Report
Sources:
ββ CISA KEV β Known exploited vulnerabilities
ββ Exploit-DB β Public exploit availability & difficulty
ββ OSV Database β Ecosystem-specific vulnerability tracking
Features:
ββ in_cisa_kev (bool)
ββ has_public_poc (bool)
ββ poc_count (int)
ββ affected_packages_count (int)
ββ primary_ecosystem (categorical)
ββ has_fixed_version (bool)
ββ min_exploit_difficulty (categorical)
ββ cisa_exploitation_deadline (date)
Sources:
ββ NVD CPE Matching β Attack vectors, complexity, privileges
ββ GitHub Advisories β Open source impact tracking
ββ AlienVault OTX β Threat actor activity & malware
Features:
ββ attack_vector (categorical: network/local/physical)
ββ requires_authentication (bool)
ββ requires_user_interaction (bool)
ββ scope_changed (bool)
ββ in_github_advisories (bool)
ββ github_affected_count (int)
ββ patch_available (bool)
ββ otx_threat_score (0-100)
ββ malware_associated (bool)
ββ active_exploits (int)
Sources:
ββ Metasploit Modules β Weaponized exploit availability
ββ Censys β Internet exposure metrics (optional)
ββ CVSS Severity β Attack impact categorization
Features:
ββ metasploit_modules (int, count)
ββ has_metasploit_module (bool)
ββ module_rank (categorical: critical/good/normal/low/unranked)
ββ module_type (categorical: exploit/reliabilty/denial_of_service)
ββ censys_exposed_count (int)
ββ has_censys_data (bool)
ββ cvss_severity_band (categorical)
ββ is_critical_cvss (bool, CVSS >= 9.0)
ββ is_high_cvss (bool, CVSS >= 7.0)
| Aspect | Model v1 | Model v3 |
|---|---|---|
| Features | 4-6 basic | 28 enriched |
| Data Sources | NVD + EPSS only | TIER 1-3 (8+ sources) |
| Risk Scale | 0-65 (linear) | 0-1 (normalized) |
| Test MAE | 0.0247 | 0.0058 (4x better) |
| Test RΒ² | 0.8934 | 0.9806 (9.8% better) |
| Confidence Score | β None | β 0-100% |
| Data Source Tracking | Generic | Explicit (3/28 features) |
| Feature Transparency | Limited | Full audit trail |
| Production Ready | β Current |
Recommendation: Use Model v3 exclusively for new deployments.
# Clone repository
git clone https://github.com/KulbirJ/Cyber-Risk-ML-Training.git
cd Cyber-Risk-ML-Training
# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1 # Windows
# or
source venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt# Create .env file with your NVD API key
echo "NVD_API_KEY=your_key_here" > .envpython deploy_model_v3.pyOutput:
β Loaded 500 enriched CVEs from cves_enhanced_tier3.csv
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CYBER RISK MODEL v3 - PRODUCTION DEPLOYMENT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Starting FastAPI server on http://localhost:8000 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INFO: Uvicorn running on http://0.0.0.0:8000
- Swagger UI: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- API Info: http://localhost:8000/
Endpoint: POST /predict
Request:
{
"cve_id": "CVE-2025-12604",
"use_enriched_data": true
}Response (Enriched Dataset):
{
"cve_id": "CVE-2025-12604",
"model_version": "v3",
"cvss_score": 7.3,
"epss_score": 0.0004,
"days_since_published": 119,
"predicted_risk_score": 0.2924,
"severity_label": "High",
"severity_numeric": 2,
"confidence": 0.9942,
"data_source": "enriched_dataset",
"timestamp": "2026-03-03T23:04:36.039054+00:00",
"features_available": 28
}Response (Live NVD):
{
"cve_id": "CVE-2026-20127",
"model_version": "v3",
"cvss_score": 10.0,
"epss_score": 0.02604,
"days_since_published": 6,
"predicted_risk_score": 0.5190,
"severity_label": "Critical",
"severity_numeric": 3,
"confidence": 0.9628,
"data_source": "nvd_live",
"timestamp": "2026-03-03T23:05:44.425784+00:00",
"features_available": 3
}python test_model_v3.pyTests Included:
- β API Health Check
- β Single CVE Prediction (Enriched)
- β Batch Predictions (3 CVEs)
- β Live NVD API Prediction
- β API Documentation & Swagger
Output:
Tests Passed: 5/5
Success Rate: 100.0%
β All tests passed! Model v3 is ready for production.
cyber-risk-ml-training/
βββ deploy_model_v3.py # FastAPI production server
βββ test_model_v3.py # Comprehensive test suite
βββ train_risk_model_v3.py # Model retraining script
β
βββ enhance_cves_tier1.py # TIER 1 enrichment (CISA, Exploit-DB, OSV)
βββ enhance_cves_tier2.py # TIER 2 enrichment (NVD, GitHub, OTX)
βββ enhance_cves_tier3.py # TIER 3 enrichment (Metasploit, Censys, CVSS)
β
βββ cyber_risk_model_v3.json # Trained regressor (28 features)
βββ cyber_risk_severity_model_v3.json # Trained classifier (28 features)
βββ cyber_risk_model_v3_metadata.json # Model metrics & performance
β
βββ cves_enhanced_tier3.csv # 500 enriched CVEs (33 columns)
βββ cves_clean.csv # Original CVE dataset (6 columns)
β
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (NVD_API_KEY)
βββ .gitignore # Security (prevents .env commit)
β
βββ README.md # This file
βββ README_API.md # API documentation
βββ DEPLOYMENT_GUIDE_V3.md # Deployment instructions
Original CVEs (6 cols)
β
βββΊ enhance_cves_tier1.py
β ββ Fetch CISA KEV data
β ββ Fetch Exploit-DB exploits
β ββ Fetch OSV database
βββΊ cves_enhanced_tier1.csv (14 cols)
β
βββΊ enhance_cves_tier2.py
β ββ Fetch NVD CPE data
β ββ Fetch GitHub Advisories
β ββ Fetch AlienVault OTX
βββΊ cves_enhanced_tier2.csv (24 cols)
β
βββΊ enhance_cves_tier3.py
β ββ Fetch Metasploit modules
β ββ Fetch Censys exposure (optional)
β ββ Calculate CVSS severity bands
βββΊ cves_enhanced_tier3.csv (33 cols)
β
βββΊ train_risk_model_v3.py
β ββ Feature engineering (28 selected features)
β ββ Train XGBRegressor (risk score)
β ββ Train XGBClassifier (severity)
βββΊ Models deployed to production
β
βββΊ cyber_risk_model_v3.json
βββΊ cyber_risk_severity_model_v3.json
βββΊ deploy_model_v3.py (API server)
- Algorithm: XGBRegressor
- Input: 28 features
- Output: 0-1 risk probability
- Test MAE: 0.0058 β
- Test RΒ²: 0.9806 (98% variance explained)
- Train/Test Split: 400/100 (80/20)
- Algorithm: XGBClassifier
- Input: 28 features
- Output: 0=Low, 1=Medium, 2=High, 3=Critical
- Test Accuracy: 100% β
- Test F1-Score: 1.0 β
- Class Distribution: 30 Low, 261 Medium, 209 High, 57 Critical
# Never commit .env file
.env # β In .gitignorefrom dotenv import load_dotenv
import os
load_dotenv()
NVD_API_KEY = os.getenv("NVD_API_KEY")- β
All API keys in
.env - β
.gitignoreprevents accidental commits - β Production uses CI/CD secrets
- β No credentials in code
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"cve_id": "CVE-2025-12604", "use_enriched_data": true}'import requests
cve_ids = ["CVE-2025-12604", "CVE-2025-12605", "CVE-2025-12606"]
results = []
for cve_id in cve_ids:
response = requests.post(
"http://localhost:8000/predict",
json={"cve_id": cve_id, "use_enriched_data": True}
)
results.append(response.json())
# Sort by risk score (highest first)
results.sort(key=lambda x: x["predicted_risk_score"], reverse=True)
for cve in results:
print(f"{cve['cve_id']}: {cve['severity_label']} ({cve['predicted_risk_score']:.2%})")$health = Invoke-WebRequest -Uri "http://localhost:8000/health" | ConvertFrom-Json
Write-Host "Model: $($health.model_version)"
Write-Host "Enriched CVEs: $($health.enriched_cves)"
Write-Host "Status: $($health.status)"| Document | Purpose |
|---|---|
| README.md | System overview (this file) |
| README_API.md | Detailed API reference |
| DEPLOYMENT_GUIDE_V3.md | Step-by-step deployment |
- Create
enhance_cves_tierX.py - Implement API integration
- Test with sample CVEs
- Update feature list
- Retrain models
- Add new features (maintain backward compatibility)
- Rerun enrichment pipeline
- Execute
train_risk_model_v3.py - Validate test metrics
- Deploy new model version
[Add your license information here]
Kulbir J - Cyber Risk ML Training System
| Issue | Resolution |
|---|---|
| Port 8000 in use | Get-NetTCPConnection -LocalPort 8000 then kill process |
| API key issues | Check .env file and NVD_API_KEY environment variable |
| Model load errors | Verify model files (.json) exist in working directory |
| Slow predictions | Check network connection (NVD API calls) |
- β Phase 1-4: Core system (complete)
- π Phase 5: Real-time threat intel integration
- π Phase 6: AutoML hyperparameter tuning
- π Phase 7: Multi-model ensemble
- π Phase 8: Horizontal scaling (Kubernetes)
Last Updated: March 4, 2026
Model Version: v3
Status: Production Ready β