Your model has 92% accuracy. It's still not safe for deployment.
Accuracy measures what went right. TrustLens measures what can go wrong — in production, on subgroups, and at high confidence.
Standard evaluation stops at accuracy. Silent failures happen when:
- A model is overconfident — "90% sure" but right only 60% of the time
- Performance collapses on subgroups — gender, age, or region hidden inside a good aggregate score
- The model is confidently wrong — high-confidence errors that indicate systemic risk
- Latent representations overlap — classes bleed together where the model can't tell them apart
TrustLens surfaces all four with a single audit, and outputs a machine-readable deployment verdict.
TrustLens uses a Prediction Resolver Architecture to automatically handle different ML frameworks:
- scikit-learn — Full support for all
ClassifierMixinestimators. - XGBoost — Native support for
XGBClassifierand rawBoosterobjects. - LightGBM — Native support for
LGBMClassifierand rawBoosterobjects. - CatBoost — Native support for
CatBoostClassifier. - Planned — PyTorch, TensorFlow/Keras.
TrustLens automatically detects your model's framework. You don't need to change your code when switching from sklearn to XGBoost.
Explore the full TrustLens documentation:
- 🚀 Getting Started
- 🏛️ Architecture Guide
- 📖 API Reference
- 🧪 Real-world Use Cases
- ⚖️ Trust Score Explained
pip install trustlens
# Extended visualization support
pip install trustlens[full]Run a one-line audit to see why 94% accuracy isn't the full story:
from trustlens import quick_analyze
quick_analyze(dataset="breast_cancer")TRUST SCORE: 68/100 [D]
Assessment : Low Trust — Blocked by high diagnostic risk
Base Score : 76
Penalties Applied : -7.7 (Failure Risk)
Final Score : 68
→ Model shows high failure risk and is NOT ready for deployment.
TrustLens runs four diagnostic modules and combines them into a single Trust Score (0–100) with a CI/CD-ready deployment verdict.
| Module | What It Catches |
|---|---|
| Calibration | Confidence vs. correctness mismatch, overconfidence, ECE |
| Fairness | Subgroup performance gaps, equalized-odds violations |
| Representation | Latent space health, class separation, overlap detection |
| Decision Engine | Composite Trust Score + Ready / Blocked verdict |
TrustLens is more than a visualization tool—it is a statistically grounded diagnostic framework. We have systematically validated its behavior across 6 model architectures and multiple data corruption scenarios (noise, imbalance, bias).
Key Finding: TrustLens empirically decouples Accuracy from Trust, flagging high-accuracy models that exhibit high reliability risks (the "Overconfidence Zone").
from trustlens import analyze
# Works the same way for XGBClassifier, LGBMClassifier, or CatBoostClassifier
from xgboost import XGBClassifier
# from lightgbm import LGBMClassifier
# from catboost import CatBoostClassifier
model = XGBClassifier().fit(X_train, y_train)
# TrustLens automatically detects the framework and resolves predictions
report = analyze(
model=model,
X=X_test,
y_true=y_test,
sensitive_features={"gender": gender_test}
)
report.show()For external inference systems or unsupported frameworks, you can pass predictions directly:
report = analyze(
model=None, # optional when passing y_pred/y_prob
X=X_test,
y_true=y_test,
y_pred=external_preds,
y_prob=external_probs
)Every report tracks its own backend provenance for auditability:
print(report.metadata["framework"]) # "xgboost" | "lightgbm" | "catboost" | "sklearn"
print(report.metadata["backend"]) # {'resolver': 'xgboost', 'framework_version': '2.0.3', ...}# Save as a unified JSON artifact (best for experiment trackers)
report.save("report.json")
# Save as a full directory bundle (best for human review)
report.save("trust_report/")trust_report/
├── trust_score.json ← deployment verdict & composite score
├── report.json ← raw diagnostic metrics
├── metadata.json ← environment, version, backend provenance
├── report.txt ← human-readable summary
└── visuals/ ← per-module diagnostic plots (PNG)
Gate model promotion on trust_score.json — no custom scripting needed:
{
"score": 68,
"grade": "D",
"verdict": "Low Trust — Blocked by high failure risk",
"is_blocked": true
}
Calibration![]() Does confidence align with correctness? |
Fairness & Bias![]() Are subgroups treated equally? |
Latent Space Health![]() Is class separation clean? |
Deployment Verdict![]() Is this model safe to ship? |
15-minute walkthrough: diagnostics, trust scoring, fairness analysis, and visual dashboards.
Want a deeper look at the architecture and design decisions? → Interactive Project Showcase
python demo.pyGenerates multi-model comparisons, fairness deep-dives, latent space projections, JSON audits, and visual dashboards across all modules.
All contributions welcome — new metrics, diagnostic plugins, and visualizations.
→ Contributing Guide · Open an Issue · Docs
@software{trustlens2026,
author = {Shahid Ul Islam},
title = {TrustLens: Audit ML models beyond accuracy},
year = {2026},
url = {https://github.com/Khanz9664/TrustLens}
}




