From 1bc5a62d474d6ff3fe77cd0b27a6c47c7a2159bb Mon Sep 17 00:00:00 2001 From: Rico Komenda Date: Fri, 20 Mar 2026 18:52:59 +0100 Subject: [PATCH 1/3] C1.6: add RLHF and fine-tuning feedback data integrity section MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit RLHF preference pairs are a distinct training data class not covered by existing general training data controls (C1.1–C1.5). Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation. Cross-checked against all existing requirements — no duplicates: - 1.6.1 annotator identity binding: distinct from 1.3.1 (platform-level access); binds identity per submitted preference pair, not per login. - 1.6.2 preference pair integrity: distinct from 1.3.2 (general labeling artifact signing); specifies exact RLHF record fields at submission time. - 1.6.3 statistical anomaly detection: distinct from 1.4.2 (general poisoning detection on feature distributions); targets comparative judgment structure — agreement rates, directional bias, velocity. --- ...0-C01-Training-Data-Integrity-and-Traceability.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md index a5646a6f..b873038f 100644 --- a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md +++ b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md @@ -77,6 +77,18 @@ Track the full journey of each dataset from source to model input for auditabili --- +## C1.6 RLHF & Fine-Tuning Feedback Data Integrity + +Reinforcement Learning from Human Feedback introduces preference pairs as a distinct training data class whose integrity cannot be guaranteed by general training data controls alone. Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation. + +| # | Description | Level | Role | +|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| +| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V | +| **1.6.2** | **Verify that** preference pair records (prompt, chosen response, rejected response, annotator identity, and timestamp) are integrity-protected using cryptographic hashes or signatures at the time of submission, and that any post-submission modification is detectable and logged. | 2 | D/V | +| **1.6.3** | **Verify that** statistical anomaly detection is applied to preference datasets prior to reward model training to identify patterns consistent with coordinated label manipulation, such as implausibly uniform annotator agreement, systematic bias toward specific response attributes, or submission velocity outliers. | 3 | D/V | + +--- + ## References * [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) From e18906a94561085ed83e97b14fe5635736ce0850 Mon Sep 17 00:00:00 2001 From: Rico Komenda Date: Mon, 23 Mar 2026 07:43:34 +0100 Subject: [PATCH 2/3] fix: address ottosulin's review comments on C1.6 RLHF controls MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1.6.1: refocused from in-platform annotator binding (already covered by 1.3.1) to the gap otto identified — annotator identity metadata must be retained alongside the preference dataset after export from the labeling platform, so attribution holds throughout the full training pipeline. 1.6.2 (old): removed per otto's recommendation — restated 1.3.2 ("cryptographic hashes applied to labeling artifacts and annotation data") for a specific data type without adding a distinct security control. Instead, 1.3.2 is updated to explicitly include fine-tuning feedback records and RLHF preference pairs in its scope. 1.6.3 → 1.6.2: renumbered following removal of old 1.6.2. --- .../0x10-C01-Training-Data-Integrity-and-Traceability.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md index b873038f..7a84f14b 100644 --- a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md +++ b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md @@ -44,7 +44,7 @@ Ensure labeling and annotation processes are access-controlled, auditable, and p | # | Description | Level | Role | |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| | **1.3.1** | **Verify that** labeling interfaces and platforms enforce access controls and maintain audit logs of all labeling activities. | 1 | D/V | -| **1.3.2** | **Verify that** cryptographic hashes or digital signatures are applied to labeling artifacts and annotation data to ensure their integrity and authenticity. | 2 | D/V | +| **1.3.2** | **Verify that** cryptographic hashes or digital signatures are applied to labeling artifacts, annotation data, and fine-tuning feedback records (including RLHF preference pairs) to ensure their integrity and authenticity. | 2 | D/V | | **1.3.3** | **Verify that** labeling audit logs are tamper-evident and that labeling platforms protect against unauthorized modifications. | 2 | D/V | | **1.3.4** | **Verify that** sensitive information in labels is redacted, anonymized, or encrypted using appropriate granularity at rest and in transit. | 2 | D/V | @@ -83,9 +83,8 @@ Reinforcement Learning from Human Feedback introduces preference pairs as a dist | # | Description | Level | Role | |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| -| **1.6.1** | **Verify that** each annotator identity is authenticated via a mechanism that binds individual identity to submitted preference pairs, so that every judgment record in the preference dataset can be attributed to a specific, verified human annotator. | 2 | D/V | -| **1.6.2** | **Verify that** preference pair records (prompt, chosen response, rejected response, annotator identity, and timestamp) are integrity-protected using cryptographic hashes or signatures at the time of submission, and that any post-submission modification is detectable and logged. | 2 | D/V | -| **1.6.3** | **Verify that** statistical anomaly detection is applied to preference datasets prior to reward model training to identify patterns consistent with coordinated label manipulation, such as implausibly uniform annotator agreement, systematic bias toward specific response attributes, or submission velocity outliers. | 3 | D/V | +| **1.6.1** | **Verify that** annotator identity metadata is exported and retained alongside the preference dataset so that every preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform. | 2 | D/V | +| **1.6.2** | **Verify that** statistical anomaly detection is applied to preference datasets prior to reward model training to identify patterns consistent with coordinated label manipulation, such as implausibly uniform annotator agreement, systematic bias toward specific response attributes, or submission velocity outliers. | 3 | D/V | --- From 3fd9518090a734a710f7aa4e85a44b479a7fae85 Mon Sep 17 00:00:00 2001 From: Rico Komenda Date: Mon, 23 Mar 2026 07:46:40 +0100 Subject: [PATCH 3/3] fix: remove C1.6 section and fold RLHF-specific points into 1.3.1 and 1.3.2 Per ottosulin's review comments: both 1.6.1 and 1.6.2 overlapped with existing C1.3 controls. Otto's suggestion in both cases was to fold the unique value into the existing controls rather than maintain a separate section. - 1.3.1: extended to include annotator identity metadata retention after export from the labeling platform (the gap 1.6.1 identified that 1.3.1 genuinely missed) - 1.3.2: extended to explicitly cover fine-tuning feedback records and RLHF preference pairs (absorbing 1.6.2) - C1.6 section removed entirely --- ...-C01-Training-Data-Integrity-and-Traceability.md | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md index 7a84f14b..16ed45e8 100644 --- a/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md +++ b/1.0/en/0x10-C01-Training-Data-Integrity-and-Traceability.md @@ -43,7 +43,7 @@ Ensure labeling and annotation processes are access-controlled, auditable, and p | # | Description | Level | Role | |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| -| **1.3.1** | **Verify that** labeling interfaces and platforms enforce access controls and maintain audit logs of all labeling activities. | 1 | D/V | +| **1.3.1** | **Verify that** labeling interfaces and platforms enforce access controls and maintain audit logs of all labeling activities, and that annotator identity metadata is exported and retained alongside the dataset so that every annotation or preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform. | 1 | D/V | | **1.3.2** | **Verify that** cryptographic hashes or digital signatures are applied to labeling artifacts, annotation data, and fine-tuning feedback records (including RLHF preference pairs) to ensure their integrity and authenticity. | 2 | D/V | | **1.3.3** | **Verify that** labeling audit logs are tamper-evident and that labeling platforms protect against unauthorized modifications. | 2 | D/V | | **1.3.4** | **Verify that** sensitive information in labels is redacted, anonymized, or encrypted using appropriate granularity at rest and in transit. | 2 | D/V | @@ -77,17 +77,6 @@ Track the full journey of each dataset from source to model input for auditabili --- -## C1.6 RLHF & Fine-Tuning Feedback Data Integrity - -Reinforcement Learning from Human Feedback introduces preference pairs as a distinct training data class whose integrity cannot be guaranteed by general training data controls alone. Adversarial annotators or tampered preference records can steer policy model behavior in ways invisible to standard dataset validation. - -| # | Description | Level | Role | -|:--------:|---------------------------------------------------------------------------------------------------------------------|:---:|:---:| -| **1.6.1** | **Verify that** annotator identity metadata is exported and retained alongside the preference dataset so that every preference pair can be attributed to a specific, verified human annotator throughout the training pipeline, not only within the labeling platform. | 2 | D/V | -| **1.6.2** | **Verify that** statistical anomaly detection is applied to preference datasets prior to reward model training to identify patterns consistent with coordinated label manipulation, such as implausibly uniform annotator agreement, systematic bias toward specific response attributes, or submission velocity outliers. | 3 | D/V | - ---- - ## References * [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)