Skip to content

Khanz9664/TrustLens

TrustLens

TrustLens

Audit ML models beyond accuracy — calibration, fairness, latent health, and deployment verdicts.


PyPI Downloads CI Coverage License: MIT Tests


Quickstart · How It Works · Demo Video · Docs · Project Showcase


Your model has 92% accuracy. It's still not safe for deployment.

Accuracy measures what went right. TrustLens measures what can go wrong — in production, on subgroups, and at high confidence.


Why TrustLens

Standard evaluation stops at accuracy. Silent failures happen when:

  • A model is overconfident — "90% sure" but right only 60% of the time
  • Performance collapses on subgroups — gender, age, or region hidden inside a good aggregate score
  • The model is confidently wrong — high-confidence errors that indicate systemic risk
  • Latent representations overlap — classes bleed together where the model can't tell them apart

TrustLens surfaces all four with a single audit, and outputs a machine-readable deployment verdict.


Supported Frameworks

TrustLens uses a Prediction Resolver Architecture to automatically handle different ML frameworks:

  • scikit-learn — Full support for all ClassifierMixin estimators.
  • XGBoost — Native support for XGBClassifier and raw Booster objects.
  • LightGBM — Native support for LGBMClassifier and raw Booster objects.
  • CatBoost — Native support for CatBoostClassifier.
  • Planned — PyTorch, TensorFlow/Keras.

TrustLens automatically detects your model's framework. You don't need to change your code when switching from sklearn to XGBoost.


Documentation

Explore the full TrustLens documentation:


Quickstart

pip install trustlens
# Extended visualization support
pip install trustlens[full]

Run a one-line audit to see why 94% accuracy isn't the full story:

from trustlens import quick_analyze

quick_analyze(dataset="breast_cancer")
TRUST SCORE: 68/100 [D]
Assessment : Low Trust — Blocked by high diagnostic risk

  Base Score        : 76
  Penalties Applied : -7.7 (Failure Risk)
  Final Score       : 68

→ Model shows high failure risk and is NOT ready for deployment.

How It Works

TrustLens runs four diagnostic modules and combines them into a single Trust Score (0–100) with a CI/CD-ready deployment verdict.

Module What It Catches
Calibration Confidence vs. correctness mismatch, overconfidence, ECE
Fairness Subgroup performance gaps, equalized-odds violations
Representation Latent space health, class separation, overlap detection
Decision Engine Composite Trust Score + Ready / Blocked verdict

Scientific Validation

TrustLens is more than a visualization tool—it is a statistically grounded diagnostic framework. We have systematically validated its behavior across 6 model architectures and multiple data corruption scenarios (noise, imbalance, bias).

Key Finding: TrustLens empirically decouples Accuracy from Trust, flagging high-accuracy models that exhibit high reliability risks (the "Overconfidence Zone").

View the Model Zoo Benchmark


Full Audit

Automatic Detection (scikit-learn / XGBoost / LightGBM / CatBoost)

from trustlens import analyze

# Works the same way for XGBClassifier, LGBMClassifier, or CatBoostClassifier
from xgboost import XGBClassifier
# from lightgbm import LGBMClassifier
# from catboost import CatBoostClassifier

model = XGBClassifier().fit(X_train, y_train)

# TrustLens automatically detects the framework and resolves predictions
report = analyze(
    model=model,
    X=X_test,
    y_true=y_test,
    sensitive_features={"gender": gender_test}
)

report.show()

Manual Prediction Override

For external inference systems or unsupported frameworks, you can pass predictions directly:

report = analyze(
    model=None, # optional when passing y_pred/y_prob
    X=X_test,
    y_true=y_test,
    y_pred=external_preds,
    y_prob=external_probs
)

Audit Metadata & Provenance

Every report tracks its own backend provenance for auditability:

print(report.metadata["framework"])  # "xgboost" | "lightgbm" | "catboost" | "sklearn"
print(report.metadata["backend"])    # {'resolver': 'xgboost', 'framework_version': '2.0.3', ...}

Save & Export

# Save as a unified JSON artifact (best for experiment trackers)
report.save("report.json")

# Save as a full directory bundle (best for human review)
report.save("trust_report/")

Output artifacts (Directory Bundle)

trust_report/
├── trust_score.json    ← deployment verdict & composite score
├── report.json         ← raw diagnostic metrics
├── metadata.json       ← environment, version, backend provenance
├── report.txt          ← human-readable summary
└── visuals/            ← per-module diagnostic plots (PNG)

CI/CD gating

Gate model promotion on trust_score.json — no custom scripting needed:

{
  "score": 68,
  "grade": "D",
  "verdict": "Low Trust — Blocked by high failure risk",
  "is_blocked": true
}

Diagnostics in Practice

Calibration

Does confidence align with correctness?
Fairness & Bias

Are subgroups treated equally?
Latent Space Health

Is class separation clean?
Deployment Verdict

Is this model safe to ship?

Demo

Watch the demo

15-minute walkthrough: diagnostics, trust scoring, fairness analysis, and visual dashboards.

Want a deeper look at the architecture and design decisions? → Interactive Project Showcase


Run the Full Demo

python demo.py

Generates multi-model comparisons, fairness deep-dives, latent space projections, JSON audits, and visual dashboards across all modules.


Contributing

All contributions welcome — new metrics, diagnostic plugins, and visualizations.

Contributing Guide · Open an Issue · Docs



Citation

@software{trustlens2026,
  author = {Shahid Ul Islam},
  title  = {TrustLens: Audit ML models beyond accuracy},
  year   = {2026},
  url    = {https://github.com/Khanz9664/TrustLens}
}

Built by Shahid Ul Islam  ·  Portfolio  ·  LinkedIn

About

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Contributors