C1.6: add RLHF and fine-tuning feedback data integrity section by RicoKomenda · Pull Request #218 · OWASP/AISVS

RicoKomenda · 2026-03-20T17:53:57Z

RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1-C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation.

Cross-checked against all existing requirements, no duplicates:

1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level access); binds identity per submitted preference pair, not per login.
1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling artifact signing); specifies exact RLHF record fields at submission time.
1.6.3 statistical anomaly detection: distinct from 1.4.2 (general poisoning detection on feature distributions); targets comparative judgment structure - agreement rates, directional bias, velocity.

RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1–C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation. Cross-checked against all existing requirements — no duplicates: - 1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level access); binds identity per submitted preference pair, not per login. - 1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling artifact signing); specifies exact RLHF record fields at submission time. - 1.6.3 statistical anomaly detection: distinct from 1.4.2 (general poisoning detection on feature distributions); targets comparative judgment structure — agreement rates, directional bias, velocity.

ottosulin

Thanks Rico, this does bring up overall better points of view on applicability to RLHF scenarios.

While my comments mention overlap, I think there clearly is missing explicit mention of RLHF specific use cases to make it clear for users that the requirements also apply there. For now I would prefer not to introduce a new RLHF section, but rather include the concerns in the existing sections. What do you think?

ottosulin · 2026-03-22T20:29:08Z

1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md

+
+| # | Description | Level | Role |
+|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:|
+| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V |


I think this overlaps 1.3.1 (access controls + audit logs on labeling platforms). In practice, afaik platforms like Scale AI already bind annotator identity per annotation through their audit logging.

If this stays, focus on what 1.3.1 genuinely misses: requiring annotator identity metadata to be retained alongside the preference dataset after export from the labeling platform. Alternatively, we could add a mention of this to 1.3.1

ottosulin · 2026-03-22T20:29:58Z

1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md

+| # | Description | Level | Role |
+|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:|
+| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V |
+| **1.6.2** | **Verify that** preference pair records (prompt, chosen response, rejected response, annotator identity, and timestamp) are integrity-protected using cryptographic hashes or signatures at the time of submission, and that any post-submission modification is detectable and logged. | 2 | D/V |


Recommend removal. This restates 1.3.2 ("cryptographic hashes or digital signatures applied to labeling artifacts and annotation data") for a specific data type. Specifying which RLHF fields to hash is an implementation detail, not a distinct security control. An auditor verifying 1.3.2 on an RLHF pipeline would already cover this.

Maybe add a short mention of this to 1.3.2 instead?

1.6.1: refocused from in-platform annotator binding (already covered by 1.3.1) to the gap otto identified — annotator identity metadata must be retained alongside the preference dataset after export from the labeling platform, so attribution holds throughout the full training pipeline. 1.6.2 (old): removed per otto's recommendation — restated 1.3.2 ("cryptographic hashes applied to labeling artifacts and annotation data") for a specific data type without adding a distinct security control. Instead, 1.3.2 is updated to explicitly include fine-tuning feedback records and RLHF preference pairs in its scope. 1.6.3 → 1.6.2: renumbered following removal of old 1.6.2.

… 1.3.2 Per ottosulin's review comments: both 1.6.1 and 1.6.2 overlapped with existing C1.3 controls. Otto's suggestion in both cases was to fold the unique value into the existing controls rather than maintain a separate section. - 1.3.1: extended to include annotator identity metadata retention after export from the labeling platform (the gap 1.6.1 identified that 1.3.1 genuinely missed) - 1.3.2: extended to explicitly cover fine-tuning feedback records and RLHF preference pairs (absorbing 1.6.2) - C1.6 section removed entirely

ottosulin

Just noting now in hindsight to me previous review that this new part

and that annotator identity metadata is exported and retained alongside the dataset so that every annotation or preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform.

... maybe should be a separate L2 requirement, but I'd say we just merge it now. There is still similar work to be done in another round of level refinement before v1 release.

ottosulin self-requested a review March 22, 2026 19:59

ottosulin assigned ottosulin and unassigned ottosulin Mar 22, 2026

ottosulin requested changes Mar 22, 2026

View reviewed changes

RicoKomenda requested a review from ottosulin March 23, 2026 06:44

ottosulin approved these changes Mar 23, 2026

View reviewed changes

ottosulin merged commit e56beaf into OWASP:main Mar 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C1.6: add RLHF and fine-tuning feedback data integrity section#218

C1.6: add RLHF and fine-tuning feedback data integrity section#218
ottosulin merged 3 commits intoOWASP:mainfrom
RicoKomenda:C01-rlhf-preference-data-integrity

RicoKomenda commented Mar 20, 2026

Uh oh!

ottosulin left a comment

Uh oh!

ottosulin Mar 22, 2026

Uh oh!

ottosulin Mar 22, 2026

Uh oh!

ottosulin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RicoKomenda commented Mar 20, 2026

Uh oh!

ottosulin left a comment

Choose a reason for hiding this comment

Uh oh!

ottosulin Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

ottosulin Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

ottosulin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants