Skip to content

C1.6: add RLHF and fine-tuning feedback data integrity section#218

Merged
ottosulin merged 3 commits intoOWASP:mainfrom
RicoKomenda:C01-rlhf-preference-data-integrity
Mar 23, 2026
Merged

C1.6: add RLHF and fine-tuning feedback data integrity section#218
ottosulin merged 3 commits intoOWASP:mainfrom
RicoKomenda:C01-rlhf-preference-data-integrity

Conversation

@RicoKomenda
Copy link
Contributor

RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1-C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation.

Cross-checked against all existing requirements, no duplicates:

  • 1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level access); binds identity per submitted preference pair, not per login.
  • 1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling artifact signing); specifies exact RLHF record fields at submission time.
  • 1.6.3 statistical anomaly detection: distinct from 1.4.2 (general poisoning detection on feature distributions); targets comparative judgment structure - agreement rates, directional bias, velocity.

RLHF preference pairs are a distinct training data class not covered by
existing general training data controls (C1.1–C1.5). Adversarial
annotators or tampered preference records can steer policy model behavior
in ways invisible to standard dataset validation.

Cross-checked against all existing requirements — no duplicates:
- 1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level
  access); binds identity per submitted preference pair, not per login.
- 1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling
  artifact signing); specifies exact RLHF record fields at submission time.
- 1.6.3 statistical anomaly detection: distinct from 1.4.2 (general
  poisoning detection on feature distributions); targets comparative
  judgment structure — agreement rates, directional bias, velocity.
@ottosulin ottosulin self-requested a review March 22, 2026 19:59
@ottosulin ottosulin assigned ottosulin and unassigned ottosulin Mar 22, 2026
Copy link
Collaborator

@ottosulin ottosulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rico, this does bring up overall better points of view on applicability to RLHF scenarios.

While my comments mention overlap, I think there clearly is missing explicit mention of RLHF specific use cases to make it clear for users that the requirements also apply there. For now I would prefer not to introduce a new RLHF section, but rather include the concerns in the existing sections. What do you think?


| # | Description | Level | Role |
|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:|
| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this overlaps 1.3.1 (access controls + audit logs on labeling platforms). In practice, afaik platforms like Scale AI already bind annotator identity per annotation through their audit logging.

If this stays, focus on what 1.3.1 genuinely misses: requiring annotator identity metadata to be retained alongside the preference dataset after export from the labeling platform. Alternatively, we could add a mention of this to 1.3.1

| # | Description | Level | Role |
|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:|
| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V |
| **1.6.2** | **Verify that** preference pair records (prompt, chosen response, rejected response, annotator identity, and timestamp) are integrity-protected using cryptographic hashes or signatures at the time of submission, and that any post-submission modification is detectable and logged. | 2 | D/V |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removal. This restates 1.3.2 ("cryptographic hashes or digital signatures applied to labeling artifacts and annotation data") for a specific data type. Specifying which RLHF fields to hash is an implementation detail, not a distinct security control. An auditor verifying 1.3.2 on an RLHF pipeline would already cover this.

Maybe add a short mention of this to 1.3.2 instead?

1.6.1: refocused from in-platform annotator binding (already covered by
1.3.1) to the gap otto identified — annotator identity metadata must be
retained alongside the preference dataset after export from the labeling
platform, so attribution holds throughout the full training pipeline.

1.6.2 (old): removed per otto's recommendation — restated 1.3.2
("cryptographic hashes applied to labeling artifacts and annotation data")
for a specific data type without adding a distinct security control.
Instead, 1.3.2 is updated to explicitly include fine-tuning feedback
records and RLHF preference pairs in its scope.

1.6.3 → 1.6.2: renumbered following removal of old 1.6.2.
@RicoKomenda RicoKomenda requested a review from ottosulin March 23, 2026 06:44
… 1.3.2

Per ottosulin's review comments: both 1.6.1 and 1.6.2 overlapped with
existing C1.3 controls. Otto's suggestion in both cases was to fold the
unique value into the existing controls rather than maintain a separate
section.

- 1.3.1: extended to include annotator identity metadata retention after
  export from the labeling platform (the gap 1.6.1 identified that 1.3.1
  genuinely missed)
- 1.3.2: extended to explicitly cover fine-tuning feedback records and
  RLHF preference pairs (absorbing 1.6.2)
- C1.6 section removed entirely
Copy link
Collaborator

@ottosulin ottosulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting now in hindsight to me previous review that this new part

and that annotator identity metadata is exported and retained alongside the dataset so that every annotation or preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform.

... maybe should be a separate L2 requirement, but I'd say we just merge it now. There is still similar work to be done in another round of level refinement before v1 release.

@ottosulin ottosulin merged commit e56beaf into OWASP:main Mar 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants