Skip to content
Merged

3.6.0 #345

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
5eb3bea
fixed the unit testing for the cli-uploader
wfy1997 Dec 16, 2025
f6a869d
fixed unit testing
wfy1997 Dec 16, 2025
c5fe2bb
Merge branch 'unit_test_fix' of https://github.com/CBIIT/crdc-datahub…
wfy1997 Dec 16, 2025
1a2dd3d
update importing path of the testing scripts
wfy1997 Dec 16, 2025
5a769ea
Create pytest.ini
wfy1997 Dec 16, 2025
cb07203
Merge pull request #329 from CBIIT/unit_test_fix
Dec 16, 2025
91edbf8
Merge pull request #331 from CBIIT/master
Dec 22, 2025
563052b
Merge pull request #332 from CBIIT/3.5.1
Jan 21, 2026
880cfef
Add sts v2 api support and property PV handling
wfy1997 Jan 29, 2026
b501bf1
integrate the v2 puller into the existing pv_puller.py, as additiona…
wfy1997 Jan 29, 2026
e4af4e8
Delete pv_term_reader_v2.py
wfy1997 Jan 29, 2026
6851e34
Refactor CDE terminology to property in PV puller
wfy1997 Jan 29, 2026
b4cc797
Coveralls integration
AustinSMueller Jan 30, 2026
4e06595
Add insert_concept_codes_v2 and update concept code handling to use m…
wfy1997 Feb 2, 2026
fb04dd6
Refactor property PV upsert and error handling
wfy1997 Feb 2, 2026
d4769f4
Merge pull request #333 from CBIIT/CRDCDH-3279
Feb 2, 2026
c501821
Add STS v2 property PV support and retrieval to the metadata validator
wfy1997 Feb 9, 2026
8fb715b
Cache property and concept code DB lookups
wfy1997 Feb 9, 2026
b35d703
Replace CDE PV constants with Property PV constants, delete pv_puller…
wfy1997 Feb 10, 2026
6165e50
Refine PROPERTY_NOT_FOUND message
wfy1997 Feb 10, 2026
4c4a508
Merge pull request #334 from CBIIT/CRDCDH-3280
Feb 10, 2026
b867ba1
Add comprehensive tests for pv_puller_v2
wfy1997 Feb 23, 2026
dee6b4c
3482: Support batched validation
AustinSMueller Feb 13, 2026
61546df
3482: Support batched metadata validation
AustinSMueller Feb 13, 2026
374ff53
Merge origin/3483 into 3482
AustinSMueller Feb 24, 2026
61260bb
3482: maintain all batched validation info
AustinSMueller Feb 24, 2026
dc4f2bd
Update based on cursor review
wfy1997 Feb 25, 2026
b1b0405
3482: Implement review feedback
AustinSMueller Mar 5, 2026
c583267
Merge pull request #337 from CBIIT/3482-batched-validation
n2iw Mar 10, 2026
b85cbe2
Merge pull request #335 from CBIIT/CRDCDH-3280
n2iw Mar 10, 2026
65c5976
3576: Option to save data files when deleting
AustinSMueller Mar 16, 2026
9c8d2e8
3576: orphaned file cleanup update
AustinSMueller Mar 19, 2026
c627ffc
3576: address comments from Copilot code review
AustinSMueller Mar 19, 2026
9fa27ce
Merge pull request #338 from CBIIT/3576-update-deleteDataRecords
n2iw Mar 24, 2026
5620de8
updated the pv puller to upload empty list to the DB if empty list is…
wfy1997 Mar 30, 2026
5ad67e7
Update pv_puller_v2.py
wfy1997 Mar 30, 2026
48228fc
Merge pull request #339 from CBIIT/CRDCDH-3642
n2iw Mar 30, 2026
6df7812
Cache synonyms
Apr 9, 2026
d1d2641
Always use lower case synonym terms as keys to optimize search. Also, ca
Apr 9, 2026
f4801b8
Merge pull request #340 from CBIIT/optimize-metadata-validation
AustinSMueller Apr 16, 2026
40d1af1
Fixed deleting one parent will delete all parents of the same type is…
Apr 28, 2026
5e6e07c
Merge pull request #341 from CBIIT/3509
AustinSMueller Apr 28, 2026
dfdca4c
Removed outdated unit tests.
Apr 28, 2026
55456c8
Merge pull request #342 from CBIIT/fix-unittest
AustinSMueller Apr 28, 2026
afd0560
Update Python base image to 3.14.4-alpine3.23
AustinSMueller May 6, 2026
43744c6
Merge pull request #343 from CBIIT/base-image-update
n2iw May 7, 2026
373baec
Fixed all fixable CVES
May 12, 2026
0425f17
Merge pull request #344 from CBIIT/CVEs
AustinSMueller May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/coveralls.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Coveralls Test

on:
workflow_dispatch:
push:
branches:
- "*.*.*"
pull_request:
branches:
- "*.*.*"

jobs:
coverage:
name: Run tests and upload coverage
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest-cov coveralls
- name: Run tests with coverage
run: |
pytest --cov=src --cov-report=xml --cov-report=term-missing --ignore=src/bento
- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment on lines +14 to +34
218 changes: 218 additions & 0 deletions docs/BATCHED_METADATA_VALIDATION_INTERFACE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Batched Metadata Validation Interface

Interface contract between the backend (SQS producer) and the validation service (SQS consumer) for batched metadata validation.

---

## SQS Messages

### Queue

FIFO queue configured via the `METADATA_QUEUE` environment variable. `MessageGroupId` is the `submissionID`; `MessageDeduplicationId` is a unique UUID (`v4()`) generated per message inside `aws-request.js`, regardless of the value passed by the caller.

### Message Types

| Type | Constant | Description |
|------|----------|-------------|
| `"Validate Metadata Batch"` | `TYPE_METADATA_VALIDATE_BATCH` (validator) / `VALIDATION.BATCH_MESSAGE_TYPE` (backend) | Batched flow (one message per chunk of records) |
| `"Validate Metadata"` | `TYPE_METADATA_VALIDATE` | Legacy single-message flow (backward-compatible) |

### Batch Message Fields

```json
{
"type": "Validate Metadata Batch",
"validationID": "<string UUID>",
"submissionID": "<string UUID>",
"scope": "new" | "all",
"dataRecordIds": ["<string UUID>", ...],
"totalBatches": 3,
"batchIndex": 0
}
```

| Field | Type | Required | Notes |
|-------|------|----------|-------|
| `type` | string | yes | Must be `"Validate Metadata Batch"` |
| `validationID` | string (UUID) | yes | Must match `_id` of an existing validation document. Message is rejected if missing; validation will appear stuck. |
| `submissionID` | string (UUID) | yes | Message is silently skipped if missing. |
| `scope` | string | yes | `"new"` or `"all"` (case-insensitive). Must be truthy. |
| `dataRecordIds` | string[] | yes | Array of `dataRecords._id` values. Must be non-empty. |
| `totalBatches` | int | yes | Total number of batch messages for this validation run. Must be >= 1. All messages in a run must carry the same value. |
| `batchIndex` | int | yes | Zero-based index of this batch within the run. |

### Legacy (Non-Batch) Message Fields

```json
{
"type": "Validate Metadata",
"validationID": "<string UUID>",
"submissionID": "<string UUID>",
"scope": "New" | "All"
}
```

No `dataRecordIds`, `totalBatches`, or `batchIndex`. The validator fetches records internally.

---

## Record Selection and Batching (Backend)

### Scope-Based Record Query

The backend selects `dataRecordIds` from the `dataRecords` collection based on `scope`:

| Scope | Query | Records returned |
|-------|-------|-----------------|
| `"new"` | `{ submissionID, status: "New" }` | Only records with status `"New"` |
| `"all"` | `{ submissionID }` | All records for the submission |

Only the `_id` field is projected; these IDs become `dataRecordIds` in the batch messages.

### Batch Size

Records are chunked into batches. Size is configurable via a `configuration` collection entry with `type: "METADATA_VALIDATION_BATCH_SIZE"` and a `size` field (integer).

| Parameter | Value |
|-----------|-------|
| Config type | `"METADATA_VALIDATION_BATCH_SIZE"` |
| Config field | `size` (Int) |
| Default | **1000** (used when config is missing or `size` is falsy) |
| Minimum | **100** (clamped with logged error if configured below) |
| Maximum | **5000** (based on SQS 256KB message limit with ~27% headroom) |

If the configured `size` exceeds 5000, the backend logs an error and clamps to 5000. If it is below 100, the backend logs an error and clamps to 100.

---

## Database Updates

### Pre-conditions (Backend Responsibility)

Before sending any SQS messages, the backend must:

1. **Set submission status** on the `submissions` document:

| Field | Value |
|-------|-------|
| `metadataValidationStatus` | `"Validating"` |
| `fileValidationStatus` | `"Validating"` (if file validation also requested) |
| `crossSubmissionStatus` | `"Validating"` (if cross-submission also requested) |

2. **Create the validation document** in the `validation` collection with at least:

| Field | Value |
|-------|-------|
| `_id` | string UUID |
| `submissionID` | string UUID |
| `type` | `["metadata"]`, `["metadata", "file"]`, etc. (always an array) |
| `scope` | `"new"` or `"all"` |
| `started` | `Date` |
| `status` | `"Validating"` |

3. **Send all batch messages** to SQS.

4. **Update the validation document** with `totalBatches` (and optionally `status`/`statusDetail` on failure). This happens **after** all SQS messages are sent, so batches may begin processing before `totalBatches` is written. The validator uses `totalBatches` from the **message**, not the document, so this is safe.

5. **Record validation metadata** on the `submissions` document:

| Field | Value |
|-------|-------|
| `validationStarted` | `Date` |
| `validationEnded` | `null` |
| `validationType` | `["metadata"]`, `["file"]`, etc. (lowercased) |
| `validationScope` | `"new"` or `"all"` (lowercased) |

### Validator Updates Per Batch

On each batch message, the validator atomically updates the **validation document** via `find_one_and_update`:

| Operation | Field | Description |
|-----------|-------|-------------|
| `$inc` | `completedBatches` | +1 per batch |
| `$inc` | `failedBatches` | +1 if the batch failed |
| `$max` | `worstBatchStatus` | Numeric precedence: Passed=0, Warning=1, Error=2 |
| `$push` | `batchStatusDetails` | Failure message string (only for failed batches) |

Completion is detected when `completedBatches >= totalBatches` (from the message, not the document).

### Validator Updates on Final Batch

When the last batch completes, the validator updates both collections in sequence:

**Validation document** (`$set`):

| Operation | Fields |
|-----------|--------|
| `$set` | `metadataStatus`, `metadataEnded`, `status` (if sole type), `ended`, `statusDetail` |

Batch-tracking fields (`completedBatches`, `failedBatches`, `batchStatusDetails`, `worstBatchStatus`) are **retained** on the validation document after completion. Since validation documents are never reused, these fields serve as a historical record of how the batched run progressed.

If the validation document's `type` array has more than one entry, overall `status` and `ended` are deferred until both metadata and file validation have finished. The worst of the two determines the overall status.

**Submission document** (`$set`):

| Field | Value |
|-------|-------|
| `metadataValidationStatus` | `"Passed"`, `"Warning"`, or `"Error"` (re-derived from data record statuses in the DB) |
| `validationEnded` | timestamp |
| `statusDetail` | `[string]` of failure messages, or `null` if all batches passed |
| `updatedAt` | timestamp |

### `statusDetail` Format

- **Batch runs:** `[string]` -- one entry per failed batch describing the failure. `null` when all batches pass.
- **Non-batch runs:** `null` (not set).
- Written to both the `validation` and `submissions` documents under the key `"statusDetail"`.

### Terminal Status Values

| Status | Meaning |
|--------|---------|
| `"Passed"` | All records valid |
| `"Warning"` | Warnings found, no errors |
| `"Error"` | Errors found, or bad input (missing submission, scope, records, model, etc.) |

---

## Data Flow

```
Backend SQS (FIFO) Validator
| |
|-- set submission "Validating" -> submissions collection |
|-- create validation doc -----> validation collection |
|-- send batch msg 0 ----------> queue -----------------------> |
|-- send batch msg 1 ----------> queue -----------------------> |-- validate records
|-- send batch msg N ----------> queue -----------------------> |-- $inc completedBatches
|-- update totalBatches -------> validation collection |
|-- record validation metadata -> submissions collection |
| |
| (last batch) |-- $set final status
| |-- update submission status
```

## Error Handling

| Scenario | Behavior |
|----------|----------|
| Missing `validationID` | Message rejected, no DB update. Validation appears stuck. |
| `totalBatches` < 1 | Message rejected, no DB update. Validation appears stuck. |
| Missing `scope` or empty `dataRecordIds` | Batch marked as failed, counter still incremented. |
| Submission/model/study not found | Batch marked as failed, counter still incremented. |
| `validate_nodes` exception | Batch marked as failed, counter still incremented. |
| Partial `dataRecordIds` match | Warning logged, continues with found records. |
| Validation document not found | `increment_completed_batches` returns `None`; finalization skipped. |
| Backend partial send failure | Backend sets `status: "Error"` and `statusDetail: ["Failed to enqueue {N} of {total} batch messages"]` on validation doc. Validator processes arrived batches but never reaches `completedBatches >= totalBatches`, so it does not finalize. |
| Backend total send failure | Backend rolls back submission validation statuses to their previous values and sets `status: "Error"` / `ended` on the validation doc. |
| Zero total records | Backend does not send messages. Rolls back `metadataValidationStatus` to `null` (no metadata to validate). |
| Zero new records (scope `"new"`) | Backend does not send messages. Preserves the previous `metadataValidationStatus` (nothing new to validate; existing status is still valid). |

---

## Schema Gap: Prisma Validation Model

The validator writes batch-tracking fields (`completedBatches`, `failedBatches`, `batchStatusDetails`, `worstBatchStatus`) directly to MongoDB. These fields are exposed in the **GraphQL schema** (`Validation` type) for frontend progress tracking, but are **missing from the Prisma `Validation` model**. This means Prisma-based queries will not return them. Since validation documents are never reused, these fields persist as a historical record after completion. If the frontend or reporting tools need to access them, either:

- Add these fields to the Prisma schema as optional, or
- Use a raw MongoDB query for validation reads.
4 changes: 3 additions & 1 deletion essentialvalidation.dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
FROM python:3.14.2-alpine3.22 AS fnl_base_image
FROM python:3.14.4-alpine3.23 AS fnl_base_image

RUN apk upgrade --no-cache

WORKDIR /usr/validator
COPY . .
Expand Down
4 changes: 3 additions & 1 deletion export.dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
FROM python:3.14.2-alpine3.22 AS fnl_base_image
FROM python:3.14.4-alpine3.23 AS fnl_base_image

RUN apk upgrade --no-cache

WORKDIR /usr/validator
COPY . .
Expand Down
4 changes: 3 additions & 1 deletion filevalidation.dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
FROM python:3.14.2-alpine3.22 AS fnl_base_image
FROM python:3.14.4-alpine3.23 AS fnl_base_image

RUN apk upgrade --no-cache

WORKDIR /usr/validator
COPY . .
Expand Down
4 changes: 3 additions & 1 deletion metadatavalidation.dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
FROM python:3.14.2-alpine3.22 AS fnl_base_image
FROM python:3.14.4-alpine3.23 AS fnl_base_image

RUN apk upgrade --no-cache

WORKDIR /usr/validator
COPY . .
Expand Down
4 changes: 3 additions & 1 deletion mongo_db_script/add_sts_config.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ db.configuration.insertOne({
"sts_data_resource": "sts_api",
"sts-dump-file-url": "https://raw.githubusercontent.com/CBIIT/crdc-datahub-terms/{}/mdb_pvs_synonyms.json",
"sts_api_all_url": "https://sts-dev.cancer.gov/all-pvs?format=json",
"sts_api_one_url": "https://sts-dev.cancer.gov/cde-pvs/{cde_code}/{cde_version}?format=json"
"sts_api_one_url": "https://sts-dev.cancer.gov/cde-pvs/{cde_code}/{cde_version}?format=json",
"sts_api_all_url_v2": "https://sts-dev.cancer.gov/v2/terms/model-pvs",
"sts_api_one_url_v2": "https://sts-dev.cancer.gov/v2/terms/model-pvs/{model}/{property}?version={version}"


}
Expand Down
4 changes: 3 additions & 1 deletion pv_puller.dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
FROM python:3.14.2-alpine3.22 AS fnl_base_image
FROM python:3.14.4-alpine3.23 AS fnl_base_image

RUN apk upgrade --no-cache

WORKDIR /usr/validator
COPY . .
Expand Down
3 changes: 3 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[pytest]
testpaths = src/test
addopts = --ignore=src/bento
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ requests_aws4auth
pymongo
python-dateutil
pandas
pytest==7.4.3
pytest>=9.0.3
pip>=26.1.1
34 changes: 34 additions & 0 deletions src/common/api_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,40 @@ def get_all_data_elements(self, api_uri):
self.log.debug(e)
self.log.exception(f'Retrieve data element by cde code failed - internal error. Please try again and contact the helpdesk if this error persists.')
return None

def get_all_data_elements_v2(self, api_uri_list):
"""
Retrieve data elements
:param api_uri: api uri
:return: data elements
"""
headers = {
"accept": "application/json"
}
try:
# response = requests.get(api_uri, headers=headers)
results_list = []
for api_uri in api_uri_list:
response = requests.get(api_uri, headers=headers, verify=False)
status = response.status_code
# self.log.info(f"get_data_element_by_cde_code response status code: {status}.")
if status == 200:
results = response.json()
if isinstance(results, dict) and "errors" in results:
self.log.error(f'Retrieve data element by cde code failed - {results.get("errors")[0].get("message")}.')
return None
else:
results_list.append(results)
else:
self.log.error(f'Retrieve data element by cde code failed (code: {status}) - internal error. Please try again and contact the helpdesk if this error persists.')
#return None
results_list.append(None)
return results_list

except Exception as e:
self.log.debug(e)
self.log.exception(f'Retrieve data element by cde code failed - internal error. Please try again and contact the helpdesk if this error persists.')
return None

def list_github_files(self, url, branch, token=None):
headers = {}
Expand Down
Loading
Loading