C1.6: add RLHF and fine-tuning feedback data integrity section#218
Conversation
RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1–C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation. Cross-checked against all existing requirements — no duplicates: - 1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level access); binds identity per submitted preference pair, not per login. - 1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling artifact signing); specifies exact RLHF record fields at submission time. - 1.6.3 statistical anomaly detection: distinct from 1.4.2 (general poisoning detection on feature distributions); targets comparative judgment structure — agreement rates, directional bias, velocity.
ottosulin
left a comment
There was a problem hiding this comment.
Thanks Rico, this does bring up overall better points of view on applicability to RLHF scenarios.
While my comments mention overlap, I think there clearly is missing explicit mention of RLHF specific use cases to make it clear for users that the requirements also apply there. For now I would prefer not to introduce a new RLHF section, but rather include the concerns in the existing sections. What do you think?
|
|
||
| | # | Description | Level | Role | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| | ||
| | **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V | |
There was a problem hiding this comment.
I think this overlaps 1.3.1 (access controls + audit logs on labeling platforms). In practice, afaik platforms like Scale AI already bind annotator identity per annotation through their audit logging.
If this stays, focus on what 1.3.1 genuinely misses: requiring annotator identity metadata to be retained alongside the preference dataset after export from the labeling platform. Alternatively, we could add a mention of this to 1.3.1
| | # | Description | Level | Role | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| | ||
| | **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V | | ||
| | **1.6.2** | **Verify that** preference pair records (prompt, chosen response, rejected response, annotator identity, and timestamp) are integrity-protected using cryptographic hashes or signatures at the time of submission, and that any post-submission modification is detectable and logged. | 2 | D/V | |
There was a problem hiding this comment.
Recommend removal. This restates 1.3.2 ("cryptographic hashes or digital signatures applied to labeling artifacts and annotation data") for a specific data type. Specifying which RLHF fields to hash is an implementation detail, not a distinct security control. An auditor verifying 1.3.2 on an RLHF pipeline would already cover this.
Maybe add a short mention of this to 1.3.2 instead?
1.6.1: refocused from in-platform annotator binding (already covered by
1.3.1) to the gap otto identified — annotator identity metadata must be
retained alongside the preference dataset after export from the labeling
platform, so attribution holds throughout the full training pipeline.
1.6.2 (old): removed per otto's recommendation — restated 1.3.2
("cryptographic hashes applied to labeling artifacts and annotation data")
for a specific data type without adding a distinct security control.
Instead, 1.3.2 is updated to explicitly include fine-tuning feedback
records and RLHF preference pairs in its scope.
1.6.3 → 1.6.2: renumbered following removal of old 1.6.2.
… 1.3.2 Per ottosulin's review comments: both 1.6.1 and 1.6.2 overlapped with existing C1.3 controls. Otto's suggestion in both cases was to fold the unique value into the existing controls rather than maintain a separate section. - 1.3.1: extended to include annotator identity metadata retention after export from the labeling platform (the gap 1.6.1 identified that 1.3.1 genuinely missed) - 1.3.2: extended to explicitly cover fine-tuning feedback records and RLHF preference pairs (absorbing 1.6.2) - C1.6 section removed entirely
ottosulin
left a comment
There was a problem hiding this comment.
Just noting now in hindsight to me previous review that this new part
and that annotator identity metadata is exported and retained alongside the dataset so that every annotation or preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform.
... maybe should be a separate L2 requirement, but I'd say we just merge it now. There is still similar work to be done in another round of level refinement before v1 release.
RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1-C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation.
Cross-checked against all existing requirements, no duplicates: