Skip to content

KulbirJ/cyber-risk-ml-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Cyber Risk ML Training System

Overview

Cyber Risk ML Training is a production-ready machine learning system that intelligently scores CVE (Common Vulnerabilities and Exposures) risks using multi-tier data enrichment and advanced ML models. The system processes vulnerability data from 8+ public sources, enriches it with 28 contextual features, and provides real-time risk predictions via a REST API. This system can be used as is in stand alone state to query single or multiple CVEs. For best value use this system as a middle/app layer in your risk assessment platform.


🎯 Key Features

Feature Details
28-Feature Enrichment TIER 1-3 data integration (CISA, NVD, GitHub, OTX, Metasploit)
Dual ML Models XGBoost Regressor (risk scoring) + Classifier (severity classification)
Production API FastAPI with Swagger UI, health checks, batch predictions
High Accuracy Test MAE=0.0058, RΒ²=0.9806, Accuracy=100%
Normalized Scoring 0-1 probability scale with confidence metrics
500 Enriched CVEs Complete dataset for training and testing
Transparent Data Lineage Track which enrichment tier provided each feature

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     CYBER RISK ML SYSTEM                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      DATA SOURCES                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ TIER 1 (Public APIs)    β”‚ TIER 2 (Enhanced)      β”‚ TIER 3 (Intel) β”‚
β”‚ β€’ CISA KEV              β”‚ β€’ NVD CPE Data         β”‚ β€’ Metasploit   β”‚
β”‚ β€’ Exploit-DB            β”‚ β€’ GitHub Advisories    β”‚ β€’ Censys       β”‚
β”‚ β€’ OSV Database          β”‚ β€’ AlienVault OTX       β”‚ β€’ CVSS Severityβ”‚
β”‚ (8 features)            β”‚ (10 features)          β”‚ (9 features)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚              β”‚                        β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   ENRICHMENT PIPELINE                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ enhance_cves_tier1.py ──► enhance_cves_tier2.py ──► enhance_cves_tier3.py β”‚
β”‚ (500 Γ— 14 cols)          (500 Γ— 24 cols)          (500 Γ— 33 cols) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FEATURE ENGINEERING                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Data cleaning & normalization                                  β”‚
β”‚ β€’ Categorical encoding (attack_vector, ecosystem, rank)          β”‚
β”‚ β€’ Feature selection (28 most predictive features)                β”‚
β”‚ β€’ Train/Test split (80/20)                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  MODEL TRAINING (v3)                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ XGBRegressor: Risk Score (0-1 probability)                    β”‚
β”‚   └─ Test MAE=0.0058, RΒ²=0.9806                                 β”‚
β”‚ β€’ XGBClassifier: Severity (0-4: Low/Med/High/Critical)          β”‚
β”‚   └─ Test Accuracy=100%, F1=1.0                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  PRODUCTION DEPLOYMENT                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ FastAPI Server (deploy_model_v3.py)                              β”‚
β”‚ β€’ POST /predict       β†’ CVE risk scoring                         β”‚
β”‚ β€’ GET  /health        β†’ Model health check                       β”‚
β”‚ β€’ GET  /docs          β†’ Swagger UI                               β”‚
β”‚ β€’ GET  /              β†’ API info                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   API CONSUMERS                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Security Tools       β€’ SOAR Platforms        β€’ Compliance      β”‚
β”‚ β€’ Ticket Systems       β€’ Dashboards            β€’ Automations     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ‘€ User Flow Diagram

START
  β”‚
  β”œβ”€β–Ί User/System provides CVE ID (e.g., CVE-2026-20127)
  β”‚
  β”œβ”€β–Ί API checks: Is this CVE in enriched dataset?
  β”‚
  β”œβ”€YES─────────────────────────────────┐
  β”‚ β€’ Load 28 enriched features          β”‚
  β”‚ β€’ Pass to XGBoost models             β”‚
  β”‚ β€’ Return full prediction             β”‚
  β”‚                    data_source: "enriched_dataset"
  β”‚                    features_available: 28
  β”‚
  β”œβ”€NO──────────────────────────────────┐
  β”‚ β€’ Fetch from NVD + EPSS APIs         β”‚
  β”‚ β€’ Extract 3 basic features           β”‚
  β”‚ β€’ Pass to XGBoost models             β”‚
  β”‚ β€’ Return live prediction             β”‚
  β”‚                    data_source: "nvd_live"
  β”‚                    features_available: 3
  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Prediction Response β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
         β”‚ β€’ Risk Score (0-1)   β”‚
         β”‚ β€’ Severity Label     β”‚
         β”‚ β€’ Confidence Score   β”‚
         β”‚ β€’ Data Source        β”‚
         β”‚ β€’ Features Used      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
              Return to Consumer
                    β”‚
                    β”œβ”€β–Ί Security Dashboard
                    β”œβ”€β–Ί Alert System
                    β”œβ”€β–Ί Remediation Tool
                    └─► Compliance Report

πŸ“Š Data Enrichment Tiers

TIER 1: Public Vulnerability Sources (8 features)

Sources:
β”œβ”€ CISA KEV         β†’ Known exploited vulnerabilities
β”œβ”€ Exploit-DB       β†’ Public exploit availability & difficulty
└─ OSV Database     β†’ Ecosystem-specific vulnerability tracking

Features:
β”œβ”€ in_cisa_kev              (bool)
β”œβ”€ has_public_poc           (bool)
β”œβ”€ poc_count                (int)
β”œβ”€ affected_packages_count  (int)
β”œβ”€ primary_ecosystem        (categorical)
β”œβ”€ has_fixed_version        (bool)
β”œβ”€ min_exploit_difficulty   (categorical)
└─ cisa_exploitation_deadline (date)

TIER 2: Enhanced Threat Intelligence (10 features)

Sources:
β”œβ”€ NVD CPE Matching     β†’ Attack vectors, complexity, privileges
β”œβ”€ GitHub Advisories    β†’ Open source impact tracking
└─ AlienVault OTX       β†’ Threat actor activity & malware

Features:
β”œβ”€ attack_vector            (categorical: network/local/physical)
β”œβ”€ requires_authentication  (bool)
β”œβ”€ requires_user_interaction (bool)
β”œβ”€ scope_changed            (bool)
β”œβ”€ in_github_advisories     (bool)
β”œβ”€ github_affected_count    (int)
β”œβ”€ patch_available          (bool)
β”œβ”€ otx_threat_score         (0-100)
β”œβ”€ malware_associated       (bool)
└─ active_exploits          (int)

TIER 3: Advanced Threat Intel (9 features)

Sources:
β”œβ”€ Metasploit Modules   β†’ Weaponized exploit availability
β”œβ”€ Censys            β†’ Internet exposure metrics (optional)
└─ CVSS Severity        β†’ Attack impact categorization

Features:
β”œβ”€ metasploit_modules      (int, count)
β”œβ”€ has_metasploit_module   (bool)
β”œβ”€ module_rank             (categorical: critical/good/normal/low/unranked)
β”œβ”€ module_type             (categorical: exploit/reliabilty/denial_of_service)
β”œβ”€ censys_exposed_count    (int)
β”œβ”€ has_censys_data         (bool)
β”œβ”€ cvss_severity_band      (categorical)
β”œβ”€ is_critical_cvss        (bool, CVSS >= 9.0)
└─ is_high_cvss           (bool, CVSS >= 7.0)

πŸ€– Model Comparison: v1 vs v3

Aspect Model v1 Model v3
Features 4-6 basic 28 enriched
Data Sources NVD + EPSS only TIER 1-3 (8+ sources)
Risk Scale 0-65 (linear) 0-1 (normalized)
Test MAE 0.0247 0.0058 (4x better)
Test RΒ² 0.8934 0.9806 (9.8% better)
Confidence Score ❌ None βœ… 0-100%
Data Source Tracking Generic Explicit (3/28 features)
Feature Transparency Limited Full audit trail
Production Ready ⚠️ Legacy βœ… Current

Recommendation: Use Model v3 exclusively for new deployments.


πŸš€ Quick Start

1. Installation

# Clone repository
git clone https://github.com/KulbirJ/Cyber-Risk-ML-Training.git
cd Cyber-Risk-ML-Training

# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1  # Windows
# or
source venv/bin/activate    # Linux/Mac

# Install dependencies
pip install -r requirements.txt

2. Setup Environment

# Create .env file with your NVD API key
echo "NVD_API_KEY=your_key_here" > .env

3. Run the API

python deploy_model_v3.py

Output:

βœ“ Loaded 500 enriched CVEs from cves_enhanced_tier3.csv

╔════════════════════════════════════════════════════════════╗
β•‘   CYBER RISK MODEL v3 - PRODUCTION DEPLOYMENT              β•‘
╠════════════════════════════════════════════════════════════╣
β•‘  Starting FastAPI server on http://localhost:8000          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

INFO:     Uvicorn running on http://0.0.0.0:8000

4. Access the API


πŸ“‘ API Usage

Predict CVE Risk

Endpoint: POST /predict

Request:

{
  "cve_id": "CVE-2025-12604",
  "use_enriched_data": true
}

Response (Enriched Dataset):

{
  "cve_id": "CVE-2025-12604",
  "model_version": "v3",
  "cvss_score": 7.3,
  "epss_score": 0.0004,
  "days_since_published": 119,
  "predicted_risk_score": 0.2924,
  "severity_label": "High",
  "severity_numeric": 2,
  "confidence": 0.9942,
  "data_source": "enriched_dataset",
  "timestamp": "2026-03-03T23:04:36.039054+00:00",
  "features_available": 28
}

Response (Live NVD):

{
  "cve_id": "CVE-2026-20127",
  "model_version": "v3",
  "cvss_score": 10.0,
  "epss_score": 0.02604,
  "days_since_published": 6,
  "predicted_risk_score": 0.5190,
  "severity_label": "Critical",
  "severity_numeric": 3,
  "confidence": 0.9628,
  "data_source": "nvd_live",
  "timestamp": "2026-03-03T23:05:44.425784+00:00",
  "features_available": 3
}

πŸ§ͺ Testing

Run Comprehensive Test Suite

python test_model_v3.py

Tests Included:

  1. βœ… API Health Check
  2. βœ… Single CVE Prediction (Enriched)
  3. βœ… Batch Predictions (3 CVEs)
  4. βœ… Live NVD API Prediction
  5. βœ… API Documentation & Swagger

Output:

Tests Passed: 5/5
Success Rate: 100.0%

βœ“ All tests passed! Model v3 is ready for production.

πŸ“š Project Structure

cyber-risk-ml-training/
β”œβ”€β”€ deploy_model_v3.py           # FastAPI production server
β”œβ”€β”€ test_model_v3.py             # Comprehensive test suite
β”œβ”€β”€ train_risk_model_v3.py       # Model retraining script
β”‚
β”œβ”€β”€ enhance_cves_tier1.py        # TIER 1 enrichment (CISA, Exploit-DB, OSV)
β”œβ”€β”€ enhance_cves_tier2.py        # TIER 2 enrichment (NVD, GitHub, OTX)
β”œβ”€β”€ enhance_cves_tier3.py        # TIER 3 enrichment (Metasploit, Censys, CVSS)
β”‚
β”œβ”€β”€ cyber_risk_model_v3.json              # Trained regressor (28 features)
β”œβ”€β”€ cyber_risk_severity_model_v3.json     # Trained classifier (28 features)
β”œβ”€β”€ cyber_risk_model_v3_metadata.json     # Model metrics & performance
β”‚
β”œβ”€β”€ cves_enhanced_tier3.csv      # 500 enriched CVEs (33 columns)
β”œβ”€β”€ cves_clean.csv               # Original CVE dataset (6 columns)
β”‚
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ .env                         # Environment variables (NVD_API_KEY)
β”œβ”€β”€ .gitignore                   # Security (prevents .env commit)
β”‚
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ README_API.md                # API documentation
└── DEPLOYMENT_GUIDE_V3.md       # Deployment instructions

πŸ”„ Data Processing Pipeline

Original CVEs (6 cols)
    β”‚
    β”œβ”€β–Ί enhance_cves_tier1.py
    β”‚   β”œβ”€ Fetch CISA KEV data
    β”‚   β”œβ”€ Fetch Exploit-DB exploits
    β”‚   └─ Fetch OSV database
    └─► cves_enhanced_tier1.csv (14 cols)
        β”‚
        β”œβ”€β–Ί enhance_cves_tier2.py
        β”‚   β”œβ”€ Fetch NVD CPE data
        β”‚   β”œβ”€ Fetch GitHub Advisories
        β”‚   └─ Fetch AlienVault OTX
        └─► cves_enhanced_tier2.csv (24 cols)
            β”‚
            β”œβ”€β–Ί enhance_cves_tier3.py
            β”‚   β”œβ”€ Fetch Metasploit modules
            β”‚   β”œβ”€ Fetch Censys exposure (optional)
            β”‚   └─ Calculate CVSS severity bands
            └─► cves_enhanced_tier3.csv (33 cols)
                β”‚
                β”œβ”€β–Ί train_risk_model_v3.py
                β”‚   β”œβ”€ Feature engineering (28 selected features)
                β”‚   β”œβ”€ Train XGBRegressor (risk score)
                β”‚   └─ Train XGBClassifier (severity)
                └─► Models deployed to production
                    β”‚
                    β”œβ”€β–Ί cyber_risk_model_v3.json
                    β”œβ”€β–Ί cyber_risk_severity_model_v3.json
                    └─► deploy_model_v3.py (API server)

πŸ“Š Model Performance Metrics

Regressor (Risk Score Prediction)

  • Algorithm: XGBRegressor
  • Input: 28 features
  • Output: 0-1 risk probability
  • Test MAE: 0.0058 βœ…
  • Test RΒ²: 0.9806 (98% variance explained)
  • Train/Test Split: 400/100 (80/20)

Classifier (Severity Classification)

  • Algorithm: XGBClassifier
  • Input: 28 features
  • Output: 0=Low, 1=Medium, 2=High, 3=Critical
  • Test Accuracy: 100% βœ…
  • Test F1-Score: 1.0 βœ…
  • Class Distribution: 30 Low, 261 Medium, 209 High, 57 Critical

πŸ” Security

API Key Protection

# Never commit .env file
.env                    # ← In .gitignore

Environment Configuration

from dotenv import load_dotenv
import os

load_dotenv()
NVD_API_KEY = os.getenv("NVD_API_KEY")

Secret Management

  • βœ… All API keys in .env
  • βœ… .gitignore prevents accidental commits
  • βœ… Production uses CI/CD secrets
  • βœ… No credentials in code

🌐 Integration Examples

cURL: Single Prediction

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"cve_id": "CVE-2025-12604", "use_enriched_data": true}'

Python: Batch Prediction

import requests

cve_ids = ["CVE-2025-12604", "CVE-2025-12605", "CVE-2025-12606"]
results = []

for cve_id in cve_ids:
    response = requests.post(
        "http://localhost:8000/predict",
        json={"cve_id": cve_id, "use_enriched_data": True}
    )
    results.append(response.json())

# Sort by risk score (highest first)
results.sort(key=lambda x: x["predicted_risk_score"], reverse=True)
for cve in results:
    print(f"{cve['cve_id']}: {cve['severity_label']} ({cve['predicted_risk_score']:.2%})")

PowerShell: Health Check

$health = Invoke-WebRequest -Uri "http://localhost:8000/health" | ConvertFrom-Json
Write-Host "Model: $($health.model_version)"
Write-Host "Enriched CVEs: $($health.enriched_cves)"
Write-Host "Status: $($health.status)"

πŸ“– Documentation

Document Purpose
README.md System overview (this file)
README_API.md Detailed API reference
DEPLOYMENT_GUIDE_V3.md Step-by-step deployment

🀝 Contributing

Adding New Enrichment Sources

  1. Create enhance_cves_tierX.py
  2. Implement API integration
  3. Test with sample CVEs
  4. Update feature list
  5. Retrain models

Improving Model Accuracy

  1. Add new features (maintain backward compatibility)
  2. Rerun enrichment pipeline
  3. Execute train_risk_model_v3.py
  4. Validate test metrics
  5. Deploy new model version

πŸ“ License

[Add your license information here]


πŸ‘¨β€πŸ’» Authors

Kulbir J - Cyber Risk ML Training System


πŸ“ž Support

Issue Resolution
Port 8000 in use Get-NetTCPConnection -LocalPort 8000 then kill process
API key issues Check .env file and NVD_API_KEY environment variable
Model load errors Verify model files (.json) exist in working directory
Slow predictions Check network connection (NVD API calls)

🎯 Roadmap

  • βœ… Phase 1-4: Core system (complete)
  • πŸ”„ Phase 5: Real-time threat intel integration
  • πŸ”„ Phase 6: AutoML hyperparameter tuning
  • πŸ”„ Phase 7: Multi-model ensemble
  • πŸ”„ Phase 8: Horizontal scaling (Kubernetes)

Last Updated: March 4, 2026
Model Version: v3
Status: Production Ready βœ…

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors