Skip to content

Observations of PFEM repo vs. Scanner #69

@lightrock

Description

@lightrock

I’ll inspect the Bawbel scanner repo and look specifically for PFEM-style gaps: evidence boundaries, contracts/schemas, tests, runtime-vs-static separation, confidence vs severity, audit/trace artifacts, and whether findings are structurally testable. Then I’ll draft GitHub issue text you can paste.

Received app response
Thought for 4m 20s

I inspected the Bawbel repo docs/source through GitHub. I did not run it locally. My read: this project is already more disciplined than most security scanners, but there is a very good PFEM-style issue to open: make the evidence/confidence layer explicit in machine-readable output instead of letting severity/risk carry too much meaning.

Why this issue is fair: Bawbel’s README describes a pipeline with multiple engines, deduplication, toxic flow analysis, findings[], suppressed_findings[], accepted_findings[], toxic_flows[], risk_score, and AIVSS score. It also documents a false-positive pipeline that includes confidence scoring. But the Finding model is centered on rule identity, AVE ID, title, severity, AIVSS score, engine, location, match, mappings, and suppression flags; it does not define first-class confidence/evidence provenance fields. The scanner assigns f.confidence dynamically during FP scoring, but ScanResult.to_dict() serializes findings without confidence, evidence stage, suppression reason for active findings, or a confidence rationale. Toxic flows are correctly modeled as derived artifacts rather than raw detections, which is excellent PFEM discipline, but the serialized toxic flow still has severity/AIVSS/remediation without confidence/evidence basis for the chain.

Here is the issue I would open.

Title: Add first-class evidence/confidence metadata to Finding and ToxicFlow JSON output

First off: I really like the direction of Bawbel. The architecture already separates several things that many scanners collapse together: AVE findings, conformance scoring, toxic flows, suppressions, accepted findings, and AIVSS severity.

I am opening this as an architecture suggestion, not a bug report.

Problem

Right now the output appears to make severity/risk very visible, but confidence/evidence quality less visible.

For MCP security scanning, I think those need to be separate fields.

A finding can be:

  • high severity, high confidence
  • high severity, low confidence
  • low severity, high confidence
  • statically inferred
  • runtime observed
  • inferred from a tool description
  • derived from multiple findings
  • suppressed automatically
  • accepted as risk
  • expired and resurfaced

Those are different evidence states.

AIVSS/risk answers: “how bad would this be?”

Confidence/evidence metadata answers: “how sure are we, and what kind of evidence supports it?”

Those should both be available to downstream tools.

Current shape I noticed

From the docs and code, Bawbel already has the pieces:

  • multiple engines: pattern, YARA, Semgrep, LLM, Magika, sandbox
  • false-positive confidence scoring
  • suppressions and justified accepted risks
  • toxic flows as derived artifacts
  • conformance reports as a separate report type
  • JSON/SARIF outputs

That is strong.

But in the public data model/output, confidence and evidence basis do not seem to be first-class stable fields on findings or toxic flows.

Proposal

Add explicit evidence/confidence metadata to Finding.to_dict() / ScanResult.to_dict() and toxic flow output.

Possible fields:

{
  "confidence": 0.92,
  "confidence_band": "high",
  "evidence_stage": "static_scan",
  "evidence_kind": "tool_description_pattern",
  "evidence_basis": [
    "pattern",
    "semgrep"
  ],
  "confidence_reason": "two engines agreed on same AVE ID and file profile was skill",
  "source_claim": "tool description declares behavior",
  "observed_behavior": null,
  "derived": false
}

For toxic flows:

{
  "flow_id": "credential-exfiltration",
  "severity": "CRITICAL",
  "aivss_score": 9.8,
  "confidence": 0.78,
  "confidence_band": "medium",
  "derived": true,
  "derived_from_findings": [
    {
      "ave_id": "AVE-2026-00003",
      "rule_id": "...",
      "confidence": 0.91,
      "engine": "pattern"
    },
    {
      "ave_id": "AVE-2026-00026",
      "rule_id": "...",
      "confidence": 0.72,
      "engine": "semgrep"
    }
  ],
  "chain_confidence_reason": "all required capabilities present, but one leg is statically inferred"
}
Why this matters

This would make triage much cleaner:

high severity + high confidence = urgent
high severity + low confidence = verify
medium severity + high confidence = backlog with confidence
derived toxic flow + mixed confidence = inspect chain legs
accepted risk + expiry = review cycle
runtime-observed drift = stronger evidence than static description match

It would also help prevent users from treating AIVSS as if it were confidence. AIVSS is severity/risk scoring. It is not the same as certainty that the finding or chain is exploitable in this environment.

PFEM-style framing

The mental model I am using is “polycentric federated evidence mesh”: keep evidence, interpretation, finding, confidence, risk, exception, and report boundaries separate.

Applied here:

tool description claim is not runtime behavior
static conformance is not runtime confidence
finding is not toxic flow
toxic flow is derived, not raw evidence
severity is not certainty
suppression is not deletion
accepted risk is not disappearance of risk
expired exception should resurface as evidence

Bawbel already has many of these boundaries. This issue is about making those boundaries more explicit and stable in the output contract.

Possible implementation path
Add optional stable fields to Finding:
confidence
confidence_band
evidence_stage
evidence_kind
confidence_reason
derived
Include those fields in JSON output.
Add analogous confidence/evidence fields to ToxicFlow.
Add golden JSON fixtures/tests so future changes do not accidentally drop evidence metadata.
Document that:
AIVSS/severity is risk impact.
Confidence is evidence certainty.
ToxicFlow confidence is derived from constituent finding confidence plus chain logic.

This would make Bawbel’s output much easier for CI systems, dashboards, runtime hooks, and human reviewers to triage without collapsing everything into one score.


A second, smaller issue could be even more concrete: **“Add golden JSON schema/fixtures for public output contracts.”** That one is very DrBones/PFEM too, because Bawbel has stable public models, but the output contract deserves exhaustive fixture tests.

```markdown
Title: Add golden JSON fixtures/schema tests for ScanResult, Finding, ToxicFlow, ConformanceReport, and suppression output

Bawbel exposes several important machine-readable outputs: findings, toxic flows, accepted findings, conformance reports, and scan results.

Because these are likely to be consumed by CI, SARIF uploaders, dashboards, and future runtime hooks, it would be useful to lock the public output contract with golden fixtures and/or JSON schemas.

Suggested fixture set:

- clean scan
- one active finding
- one low-confidence suppressed finding
- one justified false positive
- one accepted risk with expiry
- one expired accepted risk resurfacing
- one toxic flow
- one conformance A+/pass
- one conformance fail with required checks failing
- one scan error result

The goal is not just test coverage. It is contract discipline: downstream tooling should be able to trust that important fields such as severity, AIVSS, confidence/evidence metadata, suppression reason, accepted risk metadata, and toxic-flow derivation do not disappear or silently change shape.

My recommendation: open the first issue. It is bigger, more interesting, and directly builds on the conversation you already had with them. The second issue is a good follow-up once they engage.

Did you take my comment to him as guidance about the confidence% thing or was your evaluation more fresh vs. PFEM
Thought for 10s

Both — but not in a lazy way.

Your comment gave me the diagnostic question:

Where is confidence separate from risk/severity?

That came directly from what you asked him: Bawbel has risk/AIVSS/static scoring, but you didn’t see a separate confidence % attached to how sure the scanner is about a finding or chain.

Then I did a fresh PFEM-style read of the repo and docs to see whether that concern was actually supported.

And it was.

The fresh evidence was:

Bawbel has severity / AIVSS as stable fields on Finding.
Bawbel has confidence scoring internally in the FP pipeline.
But the public Finding JSON output does not expose confidence as a stable first-class field.

That is visible in the code: Finding has severity, aivss_score, engine, line, match, suppression state, etc., but no declared confidence field in the dataclass. The scanner dynamically sets f.confidence during false-positive scoring, but ScanResult.to_dict() does not serialize that confidence field into the public JSON finding output.

PFEM then sharpened the issue from:

“Add confidence %?”

into:

“Make evidence/confidence metadata first-class, stable, and machine-readable, separate from severity/risk.”

That is the more architectural version.

The repo already has good PFEM instincts. For example, ToxicFlow is explicitly described as a derived artifact, not a raw detection, computed from findings. That is excellent evidence-boundary discipline. But the toxic-flow output still exposes severity/AIVSS without a chain-confidence/evidence-basis field. So PFEM says: good boundary, now make the confidence/provenance of that derived boundary visible too.

So: your comment was the spark. The issue draft was not merely parroting it. The fresh PFEM evaluation confirmed and expanded it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions