diff --git a/.github/workflows/coveralls.yml b/.github/workflows/coveralls.yml new file mode 100644 index 00000000..5291db1e --- /dev/null +++ b/.github/workflows/coveralls.yml @@ -0,0 +1,34 @@ +name: Coveralls Test + +on: + workflow_dispatch: + push: + branches: + - "*.*.*" + pull_request: + branches: + - "*.*.*" + +jobs: + coverage: + name: Run tests and upload coverage + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + submodules: true + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.12' + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install -r requirements.txt + pip install pytest-cov coveralls + - name: Run tests with coverage + run: | + pytest --cov=src --cov-report=xml --cov-report=term-missing --ignore=src/bento + - name: Coveralls GitHub Action + uses: coverallsapp/github-action@v2 diff --git a/docs/BATCHED_METADATA_VALIDATION_INTERFACE.md b/docs/BATCHED_METADATA_VALIDATION_INTERFACE.md new file mode 100644 index 00000000..4bc2d4b9 --- /dev/null +++ b/docs/BATCHED_METADATA_VALIDATION_INTERFACE.md @@ -0,0 +1,218 @@ +# Batched Metadata Validation Interface + +Interface contract between the backend (SQS producer) and the validation service (SQS consumer) for batched metadata validation. + +--- + +## SQS Messages + +### Queue + +FIFO queue configured via the `METADATA_QUEUE` environment variable. `MessageGroupId` is the `submissionID`; `MessageDeduplicationId` is a unique UUID (`v4()`) generated per message inside `aws-request.js`, regardless of the value passed by the caller. + +### Message Types + +| Type | Constant | Description | +|------|----------|-------------| +| `"Validate Metadata Batch"` | `TYPE_METADATA_VALIDATE_BATCH` (validator) / `VALIDATION.BATCH_MESSAGE_TYPE` (backend) | Batched flow (one message per chunk of records) | +| `"Validate Metadata"` | `TYPE_METADATA_VALIDATE` | Legacy single-message flow (backward-compatible) | + +### Batch Message Fields + +```json +{ + "type": "Validate Metadata Batch", + "validationID": "", + "submissionID": "", + "scope": "new" | "all", + "dataRecordIds": ["", ...], + "totalBatches": 3, + "batchIndex": 0 +} +``` + +| Field | Type | Required | Notes | +|-------|------|----------|-------| +| `type` | string | yes | Must be `"Validate Metadata Batch"` | +| `validationID` | string (UUID) | yes | Must match `_id` of an existing validation document. Message is rejected if missing; validation will appear stuck. | +| `submissionID` | string (UUID) | yes | Message is silently skipped if missing. | +| `scope` | string | yes | `"new"` or `"all"` (case-insensitive). Must be truthy. | +| `dataRecordIds` | string[] | yes | Array of `dataRecords._id` values. Must be non-empty. | +| `totalBatches` | int | yes | Total number of batch messages for this validation run. Must be >= 1. All messages in a run must carry the same value. | +| `batchIndex` | int | yes | Zero-based index of this batch within the run. | + +### Legacy (Non-Batch) Message Fields + +```json +{ + "type": "Validate Metadata", + "validationID": "", + "submissionID": "", + "scope": "New" | "All" +} +``` + +No `dataRecordIds`, `totalBatches`, or `batchIndex`. The validator fetches records internally. + +--- + +## Record Selection and Batching (Backend) + +### Scope-Based Record Query + +The backend selects `dataRecordIds` from the `dataRecords` collection based on `scope`: + +| Scope | Query | Records returned | +|-------|-------|-----------------| +| `"new"` | `{ submissionID, status: "New" }` | Only records with status `"New"` | +| `"all"` | `{ submissionID }` | All records for the submission | + +Only the `_id` field is projected; these IDs become `dataRecordIds` in the batch messages. + +### Batch Size + +Records are chunked into batches. Size is configurable via a `configuration` collection entry with `type: "METADATA_VALIDATION_BATCH_SIZE"` and a `size` field (integer). + +| Parameter | Value | +|-----------|-------| +| Config type | `"METADATA_VALIDATION_BATCH_SIZE"` | +| Config field | `size` (Int) | +| Default | **1000** (used when config is missing or `size` is falsy) | +| Minimum | **100** (clamped with logged error if configured below) | +| Maximum | **5000** (based on SQS 256KB message limit with ~27% headroom) | + +If the configured `size` exceeds 5000, the backend logs an error and clamps to 5000. If it is below 100, the backend logs an error and clamps to 100. + +--- + +## Database Updates + +### Pre-conditions (Backend Responsibility) + +Before sending any SQS messages, the backend must: + +1. **Set submission status** on the `submissions` document: + + | Field | Value | + |-------|-------| + | `metadataValidationStatus` | `"Validating"` | + | `fileValidationStatus` | `"Validating"` (if file validation also requested) | + | `crossSubmissionStatus` | `"Validating"` (if cross-submission also requested) | + +2. **Create the validation document** in the `validation` collection with at least: + + | Field | Value | + |-------|-------| + | `_id` | string UUID | + | `submissionID` | string UUID | + | `type` | `["metadata"]`, `["metadata", "file"]`, etc. (always an array) | + | `scope` | `"new"` or `"all"` | + | `started` | `Date` | + | `status` | `"Validating"` | + +3. **Send all batch messages** to SQS. + +4. **Update the validation document** with `totalBatches` (and optionally `status`/`statusDetail` on failure). This happens **after** all SQS messages are sent, so batches may begin processing before `totalBatches` is written. The validator uses `totalBatches` from the **message**, not the document, so this is safe. + +5. **Record validation metadata** on the `submissions` document: + + | Field | Value | + |-------|-------| + | `validationStarted` | `Date` | + | `validationEnded` | `null` | + | `validationType` | `["metadata"]`, `["file"]`, etc. (lowercased) | + | `validationScope` | `"new"` or `"all"` (lowercased) | + +### Validator Updates Per Batch + +On each batch message, the validator atomically updates the **validation document** via `find_one_and_update`: + +| Operation | Field | Description | +|-----------|-------|-------------| +| `$inc` | `completedBatches` | +1 per batch | +| `$inc` | `failedBatches` | +1 if the batch failed | +| `$max` | `worstBatchStatus` | Numeric precedence: Passed=0, Warning=1, Error=2 | +| `$push` | `batchStatusDetails` | Failure message string (only for failed batches) | + +Completion is detected when `completedBatches >= totalBatches` (from the message, not the document). + +### Validator Updates on Final Batch + +When the last batch completes, the validator updates both collections in sequence: + +**Validation document** (`$set`): + +| Operation | Fields | +|-----------|--------| +| `$set` | `metadataStatus`, `metadataEnded`, `status` (if sole type), `ended`, `statusDetail` | + +Batch-tracking fields (`completedBatches`, `failedBatches`, `batchStatusDetails`, `worstBatchStatus`) are **retained** on the validation document after completion. Since validation documents are never reused, these fields serve as a historical record of how the batched run progressed. + +If the validation document's `type` array has more than one entry, overall `status` and `ended` are deferred until both metadata and file validation have finished. The worst of the two determines the overall status. + +**Submission document** (`$set`): + +| Field | Value | +|-------|-------| +| `metadataValidationStatus` | `"Passed"`, `"Warning"`, or `"Error"` (re-derived from data record statuses in the DB) | +| `validationEnded` | timestamp | +| `statusDetail` | `[string]` of failure messages, or `null` if all batches passed | +| `updatedAt` | timestamp | + +### `statusDetail` Format + +- **Batch runs:** `[string]` -- one entry per failed batch describing the failure. `null` when all batches pass. +- **Non-batch runs:** `null` (not set). +- Written to both the `validation` and `submissions` documents under the key `"statusDetail"`. + +### Terminal Status Values + +| Status | Meaning | +|--------|---------| +| `"Passed"` | All records valid | +| `"Warning"` | Warnings found, no errors | +| `"Error"` | Errors found, or bad input (missing submission, scope, records, model, etc.) | + +--- + +## Data Flow + +``` +Backend SQS (FIFO) Validator + | | + |-- set submission "Validating" -> submissions collection | + |-- create validation doc -----> validation collection | + |-- send batch msg 0 ----------> queue -----------------------> | + |-- send batch msg 1 ----------> queue -----------------------> |-- validate records + |-- send batch msg N ----------> queue -----------------------> |-- $inc completedBatches + |-- update totalBatches -------> validation collection | + |-- record validation metadata -> submissions collection | + | | + | (last batch) |-- $set final status + | |-- update submission status +``` + +## Error Handling + +| Scenario | Behavior | +|----------|----------| +| Missing `validationID` | Message rejected, no DB update. Validation appears stuck. | +| `totalBatches` < 1 | Message rejected, no DB update. Validation appears stuck. | +| Missing `scope` or empty `dataRecordIds` | Batch marked as failed, counter still incremented. | +| Submission/model/study not found | Batch marked as failed, counter still incremented. | +| `validate_nodes` exception | Batch marked as failed, counter still incremented. | +| Partial `dataRecordIds` match | Warning logged, continues with found records. | +| Validation document not found | `increment_completed_batches` returns `None`; finalization skipped. | +| Backend partial send failure | Backend sets `status: "Error"` and `statusDetail: ["Failed to enqueue {N} of {total} batch messages"]` on validation doc. Validator processes arrived batches but never reaches `completedBatches >= totalBatches`, so it does not finalize. | +| Backend total send failure | Backend rolls back submission validation statuses to their previous values and sets `status: "Error"` / `ended` on the validation doc. | +| Zero total records | Backend does not send messages. Rolls back `metadataValidationStatus` to `null` (no metadata to validate). | +| Zero new records (scope `"new"`) | Backend does not send messages. Preserves the previous `metadataValidationStatus` (nothing new to validate; existing status is still valid). | + +--- + +## Schema Gap: Prisma Validation Model + +The validator writes batch-tracking fields (`completedBatches`, `failedBatches`, `batchStatusDetails`, `worstBatchStatus`) directly to MongoDB. These fields are exposed in the **GraphQL schema** (`Validation` type) for frontend progress tracking, but are **missing from the Prisma `Validation` model**. This means Prisma-based queries will not return them. Since validation documents are never reused, these fields persist as a historical record after completion. If the frontend or reporting tools need to access them, either: + +- Add these fields to the Prisma schema as optional, or +- Use a raw MongoDB query for validation reads. diff --git a/essentialvalidation.dockerfile b/essentialvalidation.dockerfile index 0d6ebc72..3a6bd391 100644 --- a/essentialvalidation.dockerfile +++ b/essentialvalidation.dockerfile @@ -1,4 +1,6 @@ -FROM python:3.14.2-alpine3.22 AS fnl_base_image +FROM python:3.14.4-alpine3.23 AS fnl_base_image + +RUN apk upgrade --no-cache WORKDIR /usr/validator COPY . . diff --git a/export.dockerfile b/export.dockerfile index b34330d0..c2a97241 100644 --- a/export.dockerfile +++ b/export.dockerfile @@ -1,4 +1,6 @@ -FROM python:3.14.2-alpine3.22 AS fnl_base_image +FROM python:3.14.4-alpine3.23 AS fnl_base_image + +RUN apk upgrade --no-cache WORKDIR /usr/validator COPY . . diff --git a/filevalidation.dockerfile b/filevalidation.dockerfile index 79d2e534..353a3d95 100644 --- a/filevalidation.dockerfile +++ b/filevalidation.dockerfile @@ -1,4 +1,6 @@ -FROM python:3.14.2-alpine3.22 AS fnl_base_image +FROM python:3.14.4-alpine3.23 AS fnl_base_image + +RUN apk upgrade --no-cache WORKDIR /usr/validator COPY . . diff --git a/metadatavalidation.dockerfile b/metadatavalidation.dockerfile index 05153744..4eb800b1 100644 --- a/metadatavalidation.dockerfile +++ b/metadatavalidation.dockerfile @@ -1,4 +1,6 @@ -FROM python:3.14.2-alpine3.22 AS fnl_base_image +FROM python:3.14.4-alpine3.23 AS fnl_base_image + +RUN apk upgrade --no-cache WORKDIR /usr/validator COPY . . diff --git a/mongo_db_script/add_sts_config.js b/mongo_db_script/add_sts_config.js index 9acb3745..105b43e8 100644 --- a/mongo_db_script/add_sts_config.js +++ b/mongo_db_script/add_sts_config.js @@ -6,7 +6,9 @@ db.configuration.insertOne({ "sts_data_resource": "sts_api", "sts-dump-file-url": "https://raw.githubusercontent.com/CBIIT/crdc-datahub-terms/{}/mdb_pvs_synonyms.json", "sts_api_all_url": "https://sts-dev.cancer.gov/all-pvs?format=json", - "sts_api_one_url": "https://sts-dev.cancer.gov/cde-pvs/{cde_code}/{cde_version}?format=json" + "sts_api_one_url": "https://sts-dev.cancer.gov/cde-pvs/{cde_code}/{cde_version}?format=json", + "sts_api_all_url_v2": "https://sts-dev.cancer.gov/v2/terms/model-pvs", + "sts_api_one_url_v2": "https://sts-dev.cancer.gov/v2/terms/model-pvs/{model}/{property}?version={version}" } diff --git a/pv_puller.dockerfile b/pv_puller.dockerfile index 47298ae7..145d4ab1 100644 --- a/pv_puller.dockerfile +++ b/pv_puller.dockerfile @@ -1,4 +1,6 @@ -FROM python:3.14.2-alpine3.22 AS fnl_base_image +FROM python:3.14.4-alpine3.23 AS fnl_base_image + +RUN apk upgrade --no-cache WORKDIR /usr/validator COPY . . diff --git a/pytest.ini b/pytest.ini new file mode 100644 index 00000000..afa2442b --- /dev/null +++ b/pytest.ini @@ -0,0 +1,3 @@ +[pytest] +testpaths = src/test +addopts = --ignore=src/bento \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index ad0a399b..752db241 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,4 +6,5 @@ requests_aws4auth pymongo python-dateutil pandas -pytest==7.4.3 \ No newline at end of file +pytest>=9.0.3 +pip>=26.1.1 diff --git a/src/common/api_client.py b/src/common/api_client.py index 9c7f69d2..173f947e 100644 --- a/src/common/api_client.py +++ b/src/common/api_client.py @@ -107,6 +107,40 @@ def get_all_data_elements(self, api_uri): self.log.debug(e) self.log.exception(f'Retrieve data element by cde code failed - internal error. Please try again and contact the helpdesk if this error persists.') return None + + def get_all_data_elements_v2(self, api_uri_list): + """ + Retrieve data elements + :param api_uri: api uri + :return: data elements + """ + headers = { + "accept": "application/json" + } + try: + # response = requests.get(api_uri, headers=headers) + results_list = [] + for api_uri in api_uri_list: + response = requests.get(api_uri, headers=headers, verify=False) + status = response.status_code + # self.log.info(f"get_data_element_by_cde_code response status code: {status}.") + if status == 200: + results = response.json() + if isinstance(results, dict) and "errors" in results: + self.log.error(f'Retrieve data element by cde code failed - {results.get("errors")[0].get("message")}.') + return None + else: + results_list.append(results) + else: + self.log.error(f'Retrieve data element by cde code failed (code: {status}) - internal error. Please try again and contact the helpdesk if this error persists.') + #return None + results_list.append(None) + return results_list + + except Exception as e: + self.log.debug(e) + self.log.exception(f'Retrieve data element by cde code failed - internal error. Please try again and contact the helpdesk if this error persists.') + return None def list_github_files(self, url, branch, token=None): headers = {} diff --git a/src/common/constants.py b/src/common/constants.py index 1c31fb62..bd3684cd 100644 --- a/src/common/constants.py +++ b/src/common/constants.py @@ -59,12 +59,15 @@ ID = "_id" SIZE ="size" MD5 = "md5" -DATA_COLlECTION = "dataRecords" +DATA_COLLECTION = "dataRecords" FILE_NAME = "fileName" STATUS_ERROR = "Error" STATUS_WARNING = "Warning" STATUS_PASSED = "Passed" STATUS_NEW = "New" +FAILED = "Failed" +# For batch metadata validation, statusDetail may be a list of failure message strings. +STATUS_DETAIL = "statusDetail" SUBMISSION_ID = "submissionID" NODE_ID = "nodeID" NODE_TYPE = "nodeType" @@ -79,6 +82,9 @@ NODE_IDS = "nodeIDs" DELETE_ALL = "deleteAll" EXCLUSIVE_IDS = "exclusiveIDs" +DELETE_ORPHANED_DATA_FILES = "deleteOrphanedDataFiles" +DATA_FILE_TYPE = "data file" +S3_LIST_ORPHANS_PAGE_SIZE = 1000 FILE_NAME_FIELD = "name-field" FILE_SIZE_FIELD = "size-field" @@ -137,9 +143,25 @@ TYPE_EXPORT_METADATA = "Export Metadata" TYPE_COMPLETE_SUB = "Complete Submission" TYPE_CROSS_SUBMISSION = "Validate Cross-submission" +TYPE_METADATA_VALIDATE_BATCH = "Validate Metadata Batch" DATA_COMMONS = "dataCommons" RESTORE_DELETED_DATA_FILES = "Restore Deleted Data Files" -FAILED = "Failed" +# Batch message fields +DATA_RECORD_IDS = "dataRecordIds" +TOTAL_BATCHES = "totalBatches" +BATCH_INDEX = "batchIndex" +# Validation document fields for tracking batch progress +COMPLETED_BATCHES = "completedBatches" +FAILED_BATCHES = "failedBatches" +BATCH_STATUS_DETAILS = "batchStatusDetails" +WORST_BATCH_STATUS = "worstBatchStatus" +STATUS_PRECEDENCE = { + "Passed": 0, + "Warning": 1, + "Error": 2, + "Failed": 3, +} +PRECEDENCE_TO_STATUS = {v: k for k, v in STATUS_PRECEDENCE.items()} ADDITION_ERRORS = "additionalErrors" STUDY_ABBREVIATION = "studyAbbreviation" LIST_DELIMITER_PROP = "list-delimiter" @@ -251,8 +273,15 @@ STS_API_ALL_URL = "sts_api_all_url" STS_API_ONE_URL = "sts_api_one_url" +STS_API_ONE_URL_V2 = "sts_api_one_url_v2" STS_RESOURCE_CONFIG_TYPE = "STS_RESOURCE" STS_DATA_RESOURCE_CONFIG = "sts_data_resource" STS_DATA_RESOURCE_API = "sts_api" STS_DATA_RESOURCE_FILE = "sts_file" -STS_DUMP_CONFIG = "sts-dump-file-url" \ No newline at end of file +STS_DUMP_CONFIG = "sts-dump-file-url" +STS_API_ALL_URL_V2 = "sts_api_all_url_v2" + +PROPERTY = "property" +MODEL = "model" +PROPERTY_PERMISSIBLE_VALUES = "PermissibleValues" +PROPERTY_TERM = "Term" \ No newline at end of file diff --git a/src/common/model.py b/src/common/model.py index a6a4ea83..d7a43d03 100644 --- a/src/common/model.py +++ b/src/common/model.py @@ -105,6 +105,12 @@ def get_omit_dcf_prefix(self): def get_composition_key(self, node): return self.model[NODES_LABEL][node].get(COMPOSITION_KEY, None) + def get_model_version(self): + return self.model.get("version", None) + + def get_data_commons(self): + return self.model.get("data_commons", None) + diff --git a/src/common/mongo_dao.py b/src/common/mongo_dao.py index 761940e3..31749b63 100644 --- a/src/common/mongo_dao.py +++ b/src/common/mongo_dao.py @@ -1,18 +1,19 @@ -from pymongo import MongoClient, errors, ReplaceOne, UpdateOne, DeleteOne, DESCENDING, InsertOne -import re +from pymongo import MongoClient, errors, ReplaceOne, UpdateOne, DeleteOne, DESCENDING, InsertOne, ReturnDocument from bento.common.utils import get_logger -from common.constants import BATCH_COLLECTION, SUBMISSION_COLLECTION, DATA_COLlECTION, ID, UPDATED_AT, \ +from common.constants import BATCH_COLLECTION, SUBMISSION_COLLECTION, DATA_COLLECTION, ID, UPDATED_AT, \ SUBMISSION_ID, NODE_ID, NODE_TYPE, S3_FILE_INFO, STATUS, FILE_ERRORS, STATUS_NEW, \ PARENT_TYPE, PARENT_ID_VAL, PARENTS, FILE_VALIDATION_STATUS, METADATA_VALIDATION_STATUS, TYPE, \ FILE_MD5_COLLECTION, FILE_NAME, CRDC_ID, RELEASE_COLLECTION, DATA_COMMON_NAME, KEY, \ - VALUE_PROP, VALIDATED_AT, STATUS_ERROR, STATUS_WARNING, STATUS_PASSED, PARENT_ID_NAME, \ + VALUE_PROP, VALIDATED_AT, STATUS_ERROR, STATUS_WARNING, STATUS_PASSED, FAILED, PARENT_ID_NAME, \ SUBMISSION_REL_STATUS, SUBMISSION_REL_STATUS_DELETED, STUDY_ABBREVIATION, SUBMISSION_STATUS, STUDY_ID, \ CROSS_SUBMISSION_VALIDATION_STATUS, ADDITION_ERRORS, VALIDATION_COLLECTION, VALIDATION_ENDED, CONFIG_COLLECTION, \ BATCH_BUCKET, CDE_COLLECTION, CDE_CODE, CDE_VERSION, ENTITY_TYPE, QC_COLLECTION, QC_RESULT_ID, CONFIG_TYPE, \ - SYNONYM_COLLECTION, PV_TERM, SYNONYM_TERM, CDE_FULL_NAME, CDE_PERMISSIVE_VALUES, CREATED_AT, PROPERTIES,\ - STUDY_COLLECTION, ORGANIZATION_COLLECTION, USER_COLLECTION, PV_CONCEPT_CODE_COLLECTION, CONCEPT_CODE, PERMISSIBLE_VALUE,\ - GENERATED_PROPS, FILE_ENDED, METADATA_ENDED, METADATA_STATUS, FILE_STATUS, FILE_VALIDATION, METADATA_VALIDATION,\ - CONSENT_CODE, RELEASE + SYNONYM_COLLECTION, PV_TERM, SYNONYM_TERM, CDE_FULL_NAME, CDE_PERMISSIVE_VALUES, PROPERTY_PERMISSIBLE_VALUES, CREATED_AT, PROPERTIES, \ + STUDY_COLLECTION, ORGANIZATION_COLLECTION, USER_COLLECTION, PV_CONCEPT_CODE_COLLECTION, CONCEPT_CODE, PERMISSIBLE_VALUE, \ + GENERATED_PROPS, FILE_ENDED, METADATA_ENDED, METADATA_STATUS, FILE_STATUS, FILE_VALIDATION, METADATA_VALIDATION, \ + CONSENT_CODE, RELEASE, VERSION, PROPERTY, MODEL, \ + COMPLETED_BATCHES, FAILED_BATCHES, BATCH_STATUS_DETAILS, WORST_BATCH_STATUS, STATUS_DETAIL, \ + STATUS_PRECEDENCE, PRECEDENCE_TO_STATUS from common.utils import get_exception_msg, current_datetime, get_uuid_str from common.s3_utils import S3Service @@ -24,6 +25,10 @@ def __init__(self, connectionStr, db_name): self.client = MongoClient(connectionStr) self.db_name = db_name self.s3_service = S3Service() + self.props = {} + self.concept_codes = {} + self._pvs_by_synonym_cache = {} + """ get batch by id """ @@ -40,7 +45,7 @@ def get_batch(self, batchId): self.log.exception(e) self.log.exception(f"Failed to find batch, {batchId}: {get_exception_msg()}") return None - + """ find batch for uploaded data file """ @@ -64,6 +69,7 @@ def find_batch_by_file_name(self, submissionID, batch_type, file_name): self.log.exception(e) self.log.exception(f"Failed to find batch by data file name, {submissionID}/{batch_type}/{file_name}: {get_exception_msg()}") return None + """ get submission by id """ @@ -86,7 +92,7 @@ def get_submission(self, submissionId): """ def search_nodes_by_type_and_value(self, nodes): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] node_set, query = set(), [] for node in nodes: node_type, node_key, node_value = node.get(TYPE), node.get(KEY), node.get(VALUE_PROP) @@ -104,13 +110,13 @@ def search_nodes_by_type_and_value(self, nodes): self.log.exception(e) self.log.exception(f"Failed to search nodes: {get_exception_msg()}") return None - + ''' search nodes by node type and submission id ''' def search_nodes_by_type_and_submission(self, node_type, submission_id, exclusive_ids = []): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] try: node_ids = data_collection.find({NODE_TYPE: node_type, SUBMISSION_ID: submission_id, NODE_ID: {"$nin": exclusive_ids}}).distinct(NODE_ID) return node_ids @@ -124,13 +130,12 @@ def search_nodes_by_type_and_submission(self, node_type, submission_id, exclusiv self.log.exception(f"{submission_id}: Failed to search nodes: {get_exception_msg()}") return None - """ check node exists by node name and its value """ def search_nodes_by_index(self, nodes, submission_id): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] query = [] for node in nodes: node_type, node_key, node_value = node.get(TYPE), node.get(KEY), node.get(VALUE_PROP) @@ -146,13 +151,13 @@ def search_nodes_by_index(self, nodes, submission_id): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to search nodes: {get_exception_msg()}") return None - + """ check node exists by dataCommons, nodeType and nodeID """ def search_node_by_index_crdc(self, data_commons, node_type, node_id, excluded_submission_ids): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] try: result = data_collection.find_one({DATA_COMMON_NAME: data_commons, NODE_TYPE: node_type, NODE_ID: node_id, SUBMISSION_ID: {"$nin": excluded_submission_ids}}) @@ -166,16 +171,12 @@ def search_node_by_index_crdc(self, data_commons, node_type, node_id, excluded_s self.log.exception(f"Failed to search node for crdc_id {get_exception_msg()}") return None - """ - check node exists by dataCommons, nodeType and nodeID - """ - """ get file in dataRecord collection by fileId """ def get_file(self, fileId): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: return file_collection.find_one({ID: fileId}) except errors.PyMongoError as pe: @@ -186,12 +187,13 @@ def get_file(self, fileId): self.log.exception(e) self.log.exception(f"Failed to find data file, {fileId}: {get_exception_msg()}") return None + """ get file in dataRecord collection by fileName """ def get_file_by_name(self, fileName): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: return file_collection.find_one({"S3FileInfo.fileName": fileName}) except errors.PyMongoError as pe: @@ -201,13 +203,14 @@ def get_file_by_name(self, fileName): except Exception as e: self.log.exception(e) self.log.exception(f"Failed to find data file, {fileName}: {get_exception_msg()}") - return None + return None + """ get file records in dataRecords collection by submissionID """ def get_files_by_submission(self, submission_id): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: return list(file_collection.find({SUBMISSION_ID: submission_id, S3_FILE_INFO: {"$nin": [None, ""]}})) except errors.PyMongoError as pe: @@ -236,13 +239,14 @@ def update_batch(self, batch): self.log.exception(e) self.log.exception(f"Failed to update batch, {batch[ID]}: {get_exception_msg()}") return False + """ check if not duplications exist in dataRecords collection """ def check_metadata_ids(self, nodeType, ids, submission_id): #1. check if collection exist db = self.client[self.db_name] - collection = db[DATA_COLlECTION] + collection = db[DATA_COLLECTION] try: #2 check if keys existing in the collection result = list(collection.find({NODE_ID: {'$in': ids}, SUBMISSION_ID: submission_id, NODE_TYPE: nodeType})) @@ -254,13 +258,13 @@ def check_metadata_ids(self, nodeType, ids, submission_id): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to query DB, {nodeType}: {get_exception_msg()}!") return True - + """ update a file record in dataRecords collection """ def update_file (self, file_record): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.replace_one({ID : file_record[ID]}, file_record, False) return result.matched_count > 0 @@ -271,14 +275,14 @@ def update_file (self, file_record): except Exception as e: self.log.exception(e) self.log.exception(f"Failed to update data file, {file_record[ID]}: {get_exception_msg()}") - return False - + return False + """ update a s3 file info in dataRecords collection """ def update_file_info(self, file_record): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.update_one({ID : file_record[ID]}, {"$set": {S3_FILE_INFO: file_record[S3_FILE_INFO]}}) return result.modified_count > 0 @@ -289,39 +293,59 @@ def update_file_info(self, file_record): except Exception as e: self.log.exception(e) self.log.exception(f"Failed to update data file, {file_record[ID]}: {get_exception_msg()}") - return False - """ - update errors in submissions collection - """ - def set_submission_validation_status(self, submission, file_status, metadata_status, cross_submission_status, fileErrors, is_delete = False): + return False + + def set_submission_validation_status(self, submission, file_status, metadata_status, cross_submission_status, fileErrors, is_delete = False, status_detail=None, scope=None): + """Update validation/errors in submissions collection (incl. batch status_detail). + + FAILED is only for the validation record; it is not a valid submission status. + When metadata_status is FAILED, the submission's metadata status is not updated. + + When scope is 'new' (case-insensitive), submission metadata status is only updated + if the new result is worse than or equal to the existing (Error > Warning > Passed). + """ + if metadata_status == FAILED: + metadata_status = None updated_submission = {UPDATED_AT: current_datetime()} + if status_detail is not None: + updated_submission[STATUS_DETAIL] = status_detail db = self.client[self.db_name] file_collection = db[SUBMISSION_COLLECTION] overall_metadata_status = None try: if file_status: updated_submission[FILE_VALIDATION_STATUS] = file_status if file_status != "None" else None - if fileErrors and len(fileErrors) > 0: - updated_submission[FILE_ERRORS] = fileErrors + updated_submission[VALIDATION_ENDED] = submission[VALIDATION_ENDED] + if fileErrors is not None: + updated_submission[FILE_ERRORS] = fileErrors if fileErrors and len(fileErrors) > 0 else [] else: updated_submission[FILE_ERRORS] = [] - updated_submission[VALIDATION_ENDED] = submission[VALIDATION_ENDED] + elif fileErrors is not None: + updated_submission[FILE_ERRORS] = fileErrors if fileErrors and len(fileErrors) > 0 else [] if metadata_status: - if not ((is_delete and self.count_docs(DATA_COLlECTION, {SUBMISSION_ID: submission[ID]}) == 0)): - if metadata_status == STATUS_ERROR or metadata_status == STATUS_NEW: + if not ((is_delete and self.count_docs(DATA_COLLECTION, {SUBMISSION_ID: submission[ID]}) == 0)): + if metadata_status in (STATUS_ERROR, STATUS_NEW): overall_metadata_status = metadata_status else: - error_nodes = self.count_docs(DATA_COLlECTION, {SUBMISSION_ID: submission[ID], STATUS: STATUS_ERROR}) + error_nodes = self.count_docs(DATA_COLLECTION, {SUBMISSION_ID: submission[ID], STATUS: STATUS_ERROR}) if error_nodes > 0: overall_metadata_status = STATUS_ERROR else: - warning_nodes = self.count_docs(DATA_COLlECTION, {SUBMISSION_ID: submission[ID], STATUS: STATUS_WARNING}) - if warning_nodes > 0: + warning_nodes = self.count_docs(DATA_COLLECTION, {SUBMISSION_ID: submission[ID], STATUS: STATUS_WARNING}) + if warning_nodes > 0: overall_metadata_status = STATUS_WARNING else: overall_metadata_status = metadata_status + # When scope is "new", only update submission metadata status if new result is worse than or equal to existing + if scope and str(scope).lower() == "new" and overall_metadata_status is not None: + current_status = submission.get(METADATA_VALIDATION_STATUS) + # Treat missing/None current status as Passed (precedence 0) so we only update when new result is worse or equal + new_prec = STATUS_PRECEDENCE.get(overall_metadata_status, 0) + current_prec = STATUS_PRECEDENCE.get(current_status, 0) + if new_prec < current_prec: + overall_metadata_status = current_status # check if all file nodes are deleted - if is_delete and (self.count_docs(DATA_COLlECTION, {SUBMISSION_ID: submission[ID], S3_FILE_INFO: {"$exists": True}}) == 0): + if is_delete and (self.count_docs(DATA_COLLECTION, {SUBMISSION_ID: submission[ID], S3_FILE_INFO: {"$exists": True}}) == 0): # if file nodes are all deleted, update file validation status to new if there are still data files in the bucket otherwise set to None updated_submission[FILE_VALIDATION_STATUS] = STATUS_NEW if self.s3_service.submissionHasDataFile(submission) else None if is_delete: @@ -341,13 +365,13 @@ def set_submission_validation_status(self, submission, file_status, metadata_sta self.log.exception(e) self.log.exception(f"Failed to update submission, {submission[ID]}: {get_exception_msg()}") return False - + """ update data records based on node ID in dataRecords """ def update_data_records(self, data_records): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.bulk_write([ ReplaceOne( {ID: m[ID]}, remove_id(m), upsert=True) @@ -364,14 +388,14 @@ def update_data_records(self, data_records): self.log.exception(e) msg = f"Failed to update metadata, {get_exception_msg()}" self.log.exception(msg) - return False, msg - + return False, msg + """ update record's status, errors and warnings based on node ID in dataRecords """ def update_data_records_status(self, data_records): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.bulk_write([ UpdateOne( {ID: m[ID]}, @@ -389,13 +413,14 @@ def update_data_records_status(self, data_records): self.log.exception(e) msg = f"Failed to update metadata, {get_exception_msg()}" self.log.exception(msg) - return False, msg + return False, msg + """ update record's status, errors by additional error """ def update_data_records_addition_error(self, data_records): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.bulk_write([ UpdateOne( {ID: m[ID]}, @@ -413,13 +438,14 @@ def update_data_records_addition_error(self, data_records): self.log.exception(e) msg = f"Failed to update metadata, {get_exception_msg()}" self.log.exception(msg) - return False, msg + return False, msg + """ delete dataRecords by nodeIDs """ def delete_data_records(self, nodes): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.bulk_write([ DeleteOne( { SUBMISSION_ID: m[SUBMISSION_ID], NODE_ID: m[NODE_ID], NODE_TYPE: m[NODE_TYPE] }) @@ -442,12 +468,13 @@ def delete_data_records(self, nodes): msg = f"Failed to delete metadata, {get_exception_msg()}" self.log.exception(msg) return False, msg + """ insert batch dataRecords """ def insert_data_records (self, file_records): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.insert_many(file_records) count = len(result.inserted_ids) @@ -463,15 +490,16 @@ def insert_data_records (self, file_records): msg = f"Failed to insert data records, {get_exception_msg()}" self.log.exception(msg) return False, msg + """ retrieve dataRecords by submissionID and scope either New dataRecords or All """ def get_dataRecords(self, submission_id, scope): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: query = {'submissionID': {'$eq': submission_id}} - if scope == STATUS_NEW: + if scope and scope.lower() == STATUS_NEW.lower(): query[STATUS] = STATUS_NEW result = list(file_collection.find(query)) count = len(result) @@ -484,17 +512,17 @@ def get_dataRecords(self, submission_id, scope): except Exception as e: self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve data records, {get_exception_msg()}") - return None + return None """ retrieve dataRecord by submissionID and scope either New dataRecords or All in batch """ def get_dataRecords_chunk(self, submission_id, scope, start, size): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: query = {SUBMISSION_ID: {'$eq': submission_id}} - if scope == STATUS_NEW: + if scope and scope.lower() == STATUS_NEW.lower(): query[STATUS] = STATUS_NEW result = list(file_collection.find(query).sort({SUBMISSION_ID: 1, "nodeType": 1, "nodeID": 1}).limit(size)) else: @@ -508,13 +536,38 @@ def get_dataRecords_chunk(self, submission_id, scope, start, size): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve data records, {get_exception_msg()}") return None - + + def get_dataRecords_by_ids(self, data_record_ids): + """Fetch data records by their _id values. + + Used for batched validation where backend specifies exact record IDs. + + An empty data_record_ids list is allowed; the query returns []. + """ + db = self.client[self.db_name] + file_collection = db[DATA_COLLECTION] + try: + query = {ID: {'$in': data_record_ids}} + result = list(file_collection.find(query)) + self.log.info(f'Found {len(result)} data records for {len(data_record_ids)} requested IDs') + if len(result) < len(data_record_ids): + self.log.warning(f'Partial match: found {len(result)} of {len(data_record_ids)} requested data records') + return result + except errors.PyMongoError as pe: + self.log.exception(pe) + self.log.exception(f"Failed to fetch data records by IDs: {get_exception_msg()}") + return None + except Exception as e: + self.log.exception(e) + self.log.exception(f"Failed to fetch data records by IDs: {get_exception_msg()}") + return None + """ retrieve dataRecord by submissionID and nodeType """ def get_dataRecords_chunk_by_nodeType(self, submission_id, node_type, start, size): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: query = {SUBMISSION_ID: {'$eq': submission_id}, NODE_TYPE: {'$eq': node_type}} result = list(file_collection.find(query).sort({SUBMISSION_ID: 1, "nodeType": 1, "nodeID": 1}).skip(start).limit(size)) @@ -526,14 +579,14 @@ def get_dataRecords_chunk_by_nodeType(self, submission_id, node_type, start, siz except Exception as e: self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve data records, {get_exception_msg()}") - return None + return None """ retrieve dataRecord by nodeID """ def get_dataRecord_by_node(self, nodeID, nodeType, submission_id): db = self.client[self.db_name] - file_collection = db[DATA_COLlECTION] + file_collection = db[DATA_COLLECTION] try: result = file_collection.find_one({SUBMISSION_ID: submission_id, NODE_ID: nodeID, NODE_TYPE: nodeType}) return result @@ -544,14 +597,14 @@ def get_dataRecord_by_node(self, nodeID, nodeType, submission_id): except Exception as e: self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve data record, {get_exception_msg()}") - return None + return None """ find child node by type and id """ def get_nodes_by_parents(self, parent_ids, submission_id): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] query = [] for id in parent_ids: node_type, node_id = id.get(NODE_TYPE), id.get(NODE_ID) @@ -567,13 +620,13 @@ def get_nodes_by_parents(self, parent_ids, submission_id): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve child nodes: {get_exception_msg()}") return False, None - + """ find child nodes by nodeType, parentType and parentIDProperty and parentID """ def get_nodes_by_parent_prop(self, node_type, parent_prop, submission_id): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] query = {SUBMISSION_ID: submission_id, NODE_TYPE: node_type, PARENTS: {"$elemMatch": {PARENT_TYPE: parent_prop[PARENT_TYPE], PARENT_ID_NAME: parent_prop[PARENT_ID_NAME], PARENT_ID_VAL: parent_prop[PARENT_ID_VAL]}}} try: @@ -586,13 +639,13 @@ def get_nodes_by_parent_prop(self, node_type, parent_prop, submission_id): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve child nodes: {get_exception_msg()}") return None - + """ find node in other submission with the same study """ def find_node_in_other_submissions_in_status(self, submission_id, studyID, data_common, node_type, nodeId, status_list): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] try: submissions = None # Query submissions by both studyID and dataCommons for proper scoping @@ -620,6 +673,7 @@ def find_node_in_other_submissions_in_status(self, submission_id, studyID, data_ self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve child nodes: {get_exception_msg()}") return False, None + """ find submission by query """ @@ -635,13 +689,14 @@ def find_submissions(self, query): except Exception as e: self.log.exception(e) self.log.exception(f"Failed to retrieve submissions:: {get_exception_msg()}") - return False, None + return False, None + """ set dataRecords search index, 'submissionID_nodeType_nodeID' """ def set_search_index_dataRecords(self, submission_index, crdc_index, study_entity_type_index): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] try: index_dict = data_collection.index_information() if not index_dict.get(submission_index): @@ -662,7 +717,7 @@ def set_search_index_dataRecords(self, submission_index, crdc_index, study_entit self.log.exception(e) self.log.exception(f"Failed to set search index: {get_exception_msg()}") return False - + """ set release search index, 'dataCommons_nodeType_nodeID' """ @@ -686,6 +741,7 @@ def set_search_release_index(self, dataCommon_index, crdcID_index): self.log.exception(e) self.log.exception(f"Failed to set search index in release collection: {get_exception_msg()}") return False + """ set synonym search index, 'synonym_term' """ @@ -706,7 +762,7 @@ def set_search_synonym_index(self, synonym_index): self.log.exception(e) self.log.exception(f"Failed to set search index in synonym collection: {get_exception_msg()}") return False - + """ find cached file md5 by submissionID and fileName """ @@ -724,7 +780,7 @@ def get_file_md5(self, submission_id, file_name): self.log.exception(e) self.log.exception(f"{submission_id}: Failed to retrieve data file md5: {get_exception_msg()}") return None - + """ save file md5 info to fileMD5 collection """ @@ -742,7 +798,7 @@ def save_file_md5(self, md5_info): self.log.exception(e) self.log.exception(f"{md5_info[SUBMISSION_ID]}: Failed to save data file md5: {get_exception_msg()}") return False - + """ get release by CRDC_ID """ @@ -760,7 +816,7 @@ def get_release(self, crdc_id): self.log.exception(e) self.log.exception(f"Failed to find release record for {crdc_id}: {get_exception_msg()}") return False - + """ get release by dataCommon, nodeType and nodeId """ @@ -778,8 +834,7 @@ def search_release(self, dataCommon, node_type, node_id): self.log.exception(e) self.log.exception(f"Failed to find release record for {dataCommon}/{node_type}/{node_id}: {get_exception_msg()}") return False - - + """ insert release """ @@ -797,6 +852,7 @@ def insert_release(self, release): self.log.exception(e) self.log.exception(f"Failed to insert crdcID record: {get_exception_msg()}") return False + """ update release """ @@ -855,7 +911,7 @@ def search_node_by_study(self, studyID, entity_type, node_id): :return: """ db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] try: return data_collection.find_one({STUDY_ID: studyID, ENTITY_TYPE: entity_type, NODE_ID: node_id}) except errors.PyMongoError as pe: @@ -908,7 +964,7 @@ def search_released_node_with_status(self, data_commons, node_type, node_id, sta self.log.exception(e) self.log.exception(f"Failed to find release record for {data_commons}/{node_type}/{node_id}: {get_exception_msg()}") return False - + """ find child node by type and id """ @@ -952,7 +1008,7 @@ def find_released_nodes_by_parent(self, node_type, data_commons, parent_node): self.log.exception(e) self.log.exception(f"Failed to find release record for {data_commons}/{parent_node[PARENT_TYPE]}/{parent_node[PARENT_ID_VAL]}: {get_exception_msg()}") return None - + """ count documents in a given collection and conditions """ @@ -969,23 +1025,88 @@ def count_docs(self, collection, query): self.log.exception(e) self.log.exception(f"Failed to count documents for collection, {collection} at conditions {query}") return False - """ - update validation status - """ - def update_validation_status(self, validation_id, status, validation_end_at, validation_type=None): + + def increment_completed_batches(self, validation_id, total_batches, + batch_failed=False, batch_status=None, status_detail=None, + submission_id=None, batch_index=None): + """Atomically increment completedBatches counter for a validation. + + When batch_failed is True, also increments failedBatches. + When batch_status is provided, tracks the worst status via $max. + When status_detail is provided, appends it to batchStatusDetails via $push. + + submission_id and batch_index are optional; when provided (e.g. by batched + metadata validation), they are included in log messages. + + Returns 5-tuple: (completed_count, is_last_batch, failed_count, + worst_status_str, batch_details). + """ + db = self.client[self.db_name] + validation_collection = db[VALIDATION_COLLECTION] + log_ctx = f'validation_id={validation_id}' + if submission_id is not None: + log_ctx += f' submission_id={submission_id}' + if batch_index is not None: + log_ctx += f' batch={batch_index + 1}/{total_batches}' + try: + inc_fields = {COMPLETED_BATCHES: 1} + if batch_failed: + inc_fields[FAILED_BATCHES] = 1 + update_ops = {'$inc': inc_fields} + if batch_status is not None: + if batch_status not in STATUS_PRECEDENCE: + self.log.warning(f"Unknown batch_status '{batch_status}', treating as worst (Error)") + precedence = STATUS_PRECEDENCE.get(batch_status, STATUS_PRECEDENCE[STATUS_ERROR]) + update_ops['$max'] = {WORST_BATCH_STATUS: precedence} + if status_detail: + update_ops['$push'] = {BATCH_STATUS_DETAILS: status_detail} + result = validation_collection.find_one_and_update( + {ID: validation_id}, + update_ops, + return_document=ReturnDocument.AFTER + ) + if result: + completed = result.get(COMPLETED_BATCHES, 0) + failed = result.get(FAILED_BATCHES, 0) + is_last = completed >= total_batches + worst = PRECEDENCE_TO_STATUS.get(result.get(WORST_BATCH_STATUS, 0), STATUS_PASSED) + details = result.get(BATCH_STATUS_DETAILS, []) + self.log.info(f'Validation {log_ctx}: completed {completed}/{total_batches} batches, {failed} failed') + return completed, is_last, failed, worst, details + else: + self.log.error(f'Validation document not found: {log_ctx}') + return None, False, 0, None, [] + except errors.PyMongoError as pe: + self.log.exception(pe) + self.log.exception(f"Failed to increment completed batches for {log_ctx}: {get_exception_msg()}") + return None, False, 0, None, [] + except Exception as e: + self.log.exception(e) + self.log.exception(f"Failed to increment completed batches for {log_ctx}: {get_exception_msg()}") + return None, False, 0, None, [] + + def update_validation_status(self, validation_id, status, validation_end_at, validation_type=None, status_detail=None, submission_id=None): + """Update validation status. + + submission_id is optional; when provided (e.g. by batched metadata validation), + it is included in log messages. + """ db = self.client[self.db_name] data_collection = db[VALIDATION_COLLECTION] update_status = True update_status_value = status update_validation_end_at_value = validation_end_at + log_ctx = f'validation_id={validation_id}' + if submission_id is not None: + log_ctx += f' submission_id={submission_id}' try: validation_document = data_collection.find_one({ID: validation_id}) if validation_document is None: - self.log.error(f"No validation document found for ID: {validation_id}") + self.log.error(f"No validation document found for {log_ctx}") return False - #validation_update_dict = {STATUS: update_status_value, "ended": update_validation_end_at_value} validation_update_dict = {} - # add the file_status or metadata_status to the update dict and document + if status_detail is not None: + validation_update_dict[STATUS_DETAIL] = status_detail if validation_type: if validation_type == METADATA_VALIDATION: validation_update_dict[METADATA_ENDED] = update_validation_end_at_value @@ -1019,12 +1140,13 @@ def update_validation_status(self, validation_id, status, validation_end_at, val return True if result.modified_count > 0 and update_status else False except errors.PyMongoError as pe: self.log.exception(pe) - self.log.exception(f"Failed to update validation status for {validation_id}: {get_exception_msg()}") + self.log.exception(f"Failed to update validation status for {log_ctx}: {get_exception_msg()}") return False except Exception as e: self.log.exception(e) - self.log.exception(f"Failed to update validation status for {validation_id}: {get_exception_msg()}") + self.log.exception(f"Failed to update validation status for {log_ctx}: {get_exception_msg()}") return False + """ get bucket name based on dataCommons and type """ @@ -1042,7 +1164,7 @@ def get_bucket_name(self, type, dataCommon): except Exception as e: self.log.exception(e) self.log.exception(f"Failed to get bucket name: {get_exception_msg()}") - return None + return None def insert_cde(self, cde_list): db = self.client[self.db_name] @@ -1064,7 +1186,39 @@ def insert_cde(self, cde_list): msg = f"Failed to upsert CDE PV, {get_exception_msg()}" self.log.exception(msg) return False, msg - + + def upsert_property_pv(self, prop_list): + db = self.client[self.db_name] + data_collection = db["propertyPVs"] + commands = [] + try: + for m in list(prop_list): + query = {PROPERTY: m[PROPERTY], VERSION: m[VERSION], MODEL: m[MODEL]} + property = data_collection.find_one(query) + if property: + property[UPDATED_AT] = current_datetime() + property[PROPERTY_PERMISSIBLE_VALUES] = m[PROPERTY_PERMISSIBLE_VALUES] + commands.append(UpdateOne({ID: property[ID]}, {"$set": property})) + else: + m[CREATED_AT] = current_datetime() + m[UPDATED_AT] = current_datetime() + m[ID] = get_uuid_str() + commands.append(InsertOne(m)) + if len(commands) > 0: + result = data_collection.bulk_write(commands) + self.log.info(f'Total {result.inserted_count} property PV are inserted and {result.modified_count} property PV are updated.') + return True, None + except errors.PyMongoError as pe: + self.log.exception(pe) + msg = f"Failed to upsert property PV ." + self.log.exception(msg) + return False, msg + except Exception as e: + self.log.exception(e) + msg = f"Failed to upsert property PV, {get_exception_msg()}" + self.log.exception(msg) + return False, msg + def upsert_cde(self, cde_list): db = self.client[self.db_name] data_collection = db[CDE_COLLECTION] @@ -1097,6 +1251,7 @@ def upsert_cde(self, cde_list): msg = f"Failed to upsert CDE PV, {get_exception_msg()}" self.log.exception(msg) return False, msg + """ set CDE search index, 'CDECode_1_CDEVersion_1' """ @@ -1117,6 +1272,7 @@ def set_search_cde_index(self, cde_search_index): self.log.exception(e) self.log.exception(f"Failed to set search index in CDE collection: {get_exception_msg()}") return False + """ get CDE permissible values """ @@ -1136,6 +1292,27 @@ def get_cde_permissible_values(self, cde_code, cde_version): self.log.exception(e) self.log.exception(f"Failed to get permissible values for {cde_code}/{cde_version}: {get_exception_msg()}") return None + + def get_property_permissible_values(self, model, version, prop): + prop_key = f"{model}_{version}_{prop}" + if self.props.get(prop_key) is not None: + return self.props.get(prop_key) + db = self.client[self.db_name] + data_collection = db["propertyPVs"] + query = {PROPERTY: prop, VERSION: version, MODEL: model} + try: + property_result = data_collection.find_one(query, sort=[( VERSION, DESCENDING )]) #find latest version + self.props[prop_key] = property_result + return property_result + except errors.PyMongoError as pe: + self.log.exception(pe) + self.log.exception(f"Failed to get permissible values for {prop}/{version}: {get_exception_msg()}") + return None + except Exception as e: + self.log.exception(e) + self.log.exception(f"Failed to get permissible values for {prop}/{version}: {get_exception_msg()}") + return None + """ get qc record by qc_id :param qc_id: @@ -1172,6 +1349,7 @@ def delete_qcRecord(self, qc_id): self.log.exception(e) self.log.exception(f"Failed to delete qc record for {qc_id}: {get_exception_msg()}") return False + """ delete qc records by qc_id list :param qc_id: @@ -1184,11 +1362,11 @@ def delete_qcRecords(self, qc_ids): return True if result.deleted_count > 0 else False except errors.PyMongoError as pe: self.log.exception(pe) - self.log.exception(f"Failed to delete qc record for {qc_id}: {get_exception_msg()}") + self.log.exception(f"Failed to delete qc records for {qc_ids}: {get_exception_msg()}") return False except Exception as e: self.log.exception(e) - self.log.exception(f"Failed to delete qc record for {qc_id}: {get_exception_msg()}") + self.log.exception(f"Failed to delete qc records for {qc_ids}: {get_exception_msg()}") return False """ @@ -1215,6 +1393,7 @@ def save_qc_results(self, qc_list): msg = f"Failed to upsert QC records, {get_exception_msg()}" self.log.exception(msg) return False, msg + """ get configuration by env var :param env_var: @@ -1233,16 +1412,25 @@ def get_configuration_by_ev_var(self, env_var_list): self.log.exception(e) self.log.exception(f"Failed to get configurations for {env_var_list}: {get_exception_msg()}") return None + """ - find synonym records in synonyms collection by synonym term + find synonym records in synonyms collection by synonym term. + Stored synonym terms are lowercase; lookup normalizes the input the same way as ingest (strip + lower). :param synonym: """ def find_pvs_by_synonym(self, synonym): + term = str(synonym).strip().lower() + if not term: + return [] + if term in self._pvs_by_synonym_cache: + return list(self._pvs_by_synonym_cache[term]) db = self.client[self.db_name] data_collection = db[SYNONYM_COLLECTION] - query ={SYNONYM_TERM: {"$regex": f"^{re.escape(synonym)}$", "$options": "i"}} #case-insensitive , + query = {SYNONYM_TERM: term} try: - return list(data_collection.find(query)) + results = list(data_collection.find(query)) + self._pvs_by_synonym_cache[term] = results + return list(results) except errors.PyMongoError as pe: self.log.exception(pe) self.log.exception(f"Failed to get synonyms for {synonym}: {get_exception_msg()}") @@ -1251,6 +1439,7 @@ def find_pvs_by_synonym(self, synonym): self.log.exception(e) self.log.exception(f"Failed to get synonyms for {synonym}: {get_exception_msg()}") return None + """ upsert synonym records :param synonym_list @@ -1282,6 +1471,7 @@ def insert_synonyms(self, synonym_list): msg = f"Failed to upsert synonyms, {get_exception_msg()}" self.log.exception(msg) return None + """ upsert pv concept codes :param concept_codes @@ -1313,16 +1503,50 @@ def insert_concept_codes(self, concept_codes): msg = f"Failed to upsert concept code, {get_exception_msg()}" self.log.exception(msg) return None + + def insert_concept_codes_v2(self, concept_codes): + db = self.client[self.db_name] + data_collection = db[PV_CONCEPT_CODE_COLLECTION] + to_insert = [] + try: + for item in concept_codes: + concept_code = {MODEL: item[0], PROPERTY: item[1], PERMISSIBLE_VALUE: item[2], CONCEPT_CODE: item[3]} + # check if synonym exists + existing_concept_code = data_collection.find_one(concept_code) + if existing_concept_code: + continue + to_insert.append({ID: get_uuid_str(), **concept_code}) + + if len(to_insert) == 0: + return 0 + result = data_collection.insert_many(to_insert) + return len(result.inserted_ids) + except errors.PyMongoError as pe: + self.log.exception(pe) + msg = f"Failed to upsert concept code, {get_exception_msg()}" + self.log.exception(msg) + return None + except Exception as e: + self.log.exception(e) + msg = f"Failed to upsert concept code, {get_exception_msg()}" + self.log.exception(msg) + return None + """ get concept code by pv :param pv """ - def get_concept_code_by_pv(self, cde, pv): + def get_concept_code_by_pv(self, property, model, pv): + pv_key = f"{property}_{model}_{pv}" + if self.concept_codes.get(pv_key) is not None: + return self.concept_codes.get(pv_key) db = self.client[self.db_name] data_collection = db[PV_CONCEPT_CODE_COLLECTION] - query = {CDE_CODE: cde, PERMISSIBLE_VALUE: pv} + query = {PROPERTY: property, MODEL: model, PERMISSIBLE_VALUE: pv} try: - return data_collection.find_one(query) + pv_result = data_collection.find_one(query) + self.concept_codes[pv_key] = pv_result + return pv_result except errors.PyMongoError as pe: self.log.exception(pe) self.log.exception(f"Failed to get concept code for {pv}: {get_exception_msg()}") @@ -1331,7 +1555,7 @@ def get_concept_code_by_pv(self, cde, pv): self.log.exception(e) self.log.exception(f"Failed to get concept code for {pv}: {get_exception_msg()}") return None - + """ find study by study_id :param study_id @@ -1403,7 +1627,7 @@ def find_user_by_id(self, id): def find_grandparent_by_parent(self, parentType, parentIDValue, submissionID, dataCommon): db = self.client[self.db_name] - data_collection = db[DATA_COLlECTION] + data_collection = db[DATA_COLLECTION] query = {SUBMISSION_ID: submissionID, NODE_TYPE: parentType, NODE_ID: parentIDValue} data_collection_release = db[RELEASE] query_release = {DATA_COMMON_NAME: dataCommon, NODE_TYPE: parentType, NODE_ID: parentIDValue} @@ -1428,10 +1652,8 @@ def find_grandparent_by_parent(self, parentType, parentIDValue, submissionID, da self.log.exception(f"Failed to get grandparent for {parentIDValue}: {get_exception_msg()}") return None -""" -remove _id from records for update -""" def remove_id (data_record): + """Remove _id from records for update.""" data = {} for k in data_record.keys(): if k == ID: diff --git a/src/common/utils.py b/src/common/utils.py index 3a90f866..4d80e5ff 100644 --- a/src/common/utils.py +++ b/src/common/utils.py @@ -12,7 +12,7 @@ from bento.common.utils import get_stream_md5 from datetime import datetime import uuid -from common.constants import QC_SEVERITY, LAST_MODIFIED +from common.constants import QC_SEVERITY, LAST_MODIFIED, PROPERTY_PERMISSIBLE_VALUES VALIDATION_MESSAGE_CONFIG_FILE = "configs/messages_configuration.yml" VALIDATION_MESSAGES = "Messages" @@ -327,7 +327,13 @@ def convert_file_size(size_bytes): p = np.power(1024, i) s = round(size_bytes / p, 2) return "%s %s" % (s, size_name[i]) - - +def has_permissive_value(prop): + """ + check if the property has permissive values + """ + if prop.get(PROPERTY_PERMISSIBLE_VALUES) is not None: + if len(prop[PROPERTY_PERMISSIBLE_VALUES]) > 0: + return True, prop[PROPERTY_PERMISSIBLE_VALUES] + return False, None diff --git a/src/config.py b/src/config.py index 2f05e68a..ddf4c31e 100644 --- a/src/config.py +++ b/src/config.py @@ -3,10 +3,10 @@ import yaml from common.constants import MONGO_DB, SQS_NAME, DB, MODEL_FILE_DIR, SERVICE_TYPE_PV_PULLER,\ LOADER_QUEUE, SERVICE_TYPE, SERVICE_TYPE_ESSENTIAL, SERVICE_TYPE_FILE, SERVICE_TYPE_METADATA, \ - SERVICE_TYPES, DB, FILE_QUEUE, METADATA_QUEUE, TIER, TIER_CONFIG, SERVICE_TYPE_EXPORT, EXPORTER_QUEUE,\ + SERVICE_TYPES, DB, FILE_QUEUE, METADATA_QUEUE, STS_API_ALL_URL_V2, TIER, TIER_CONFIG, SERVICE_TYPE_EXPORT, EXPORTER_QUEUE,\ DM_BUCKET_CONFIG_NAME, PROD_BUCKET_CONFIG_NAME, DATASYNC_ROLE_ARN_CONFIG , DATASYNC_ROLE_ARN_ENV, CONFIG_TYPE, \ - CONFIG_KEY, CDE_API_URL, SYNONYM_API_URL, DATASYNC_LOG_ARN_ENV, DATASYNC_LOG_ARN_CONFIG, STS_RESOURCE_CONFIG_TYPE,\ - STS_DATA_RESOURCE_CONFIG, STS_DUMP_CONFIG, STS_API_ALL_URL, STS_API_ONE_URL + CONFIG_KEY, DATASYNC_LOG_ARN_ENV, DATASYNC_LOG_ARN_CONFIG, STS_RESOURCE_CONFIG_TYPE,\ + STS_DATA_RESOURCE_CONFIG, STS_DUMP_CONFIG, STS_API_ALL_URL, STS_API_ONE_URL, STS_API_ONE_URL_V2 from bento.common.utils import get_logger from common.utils import clean_up_key_value, get_exception_msg, load_message_config from common.mongo_dao import MongoDao @@ -120,11 +120,12 @@ def validate(self): self.log.critical(f'No sts resource is configured in both env and args!') return False else: - self.data[STS_DATA_RESOURCE_CONFIG] = sts_resource[STS_DATA_RESOURCE_CONFIG] if sts_resource.get(STS_DATA_RESOURCE_CONFIG) else None - self.data[STS_DUMP_CONFIG] = sts_resource[STS_DUMP_CONFIG] if sts_resource.get(STS_DUMP_CONFIG) else None - self.data[STS_API_ALL_URL] = sts_resource[STS_API_ALL_URL] if sts_resource.get(STS_API_ALL_URL) else None - self.data[STS_API_ONE_URL] = sts_resource[STS_API_ONE_URL] if sts_resource.get(STS_API_ONE_URL) else None - + self.data[STS_DATA_RESOURCE_CONFIG] = sts_resource.get(STS_DATA_RESOURCE_CONFIG) + self.data[STS_DUMP_CONFIG] = sts_resource.get(STS_DUMP_CONFIG) + self.data[STS_API_ALL_URL] = sts_resource.get(STS_API_ALL_URL) + self.data[STS_API_ONE_URL] = sts_resource.get(STS_API_ONE_URL) + self.data[STS_API_ALL_URL_V2] = sts_resource.get(STS_API_ALL_URL_V2) + self.data[STS_API_ONE_URL_V2] = sts_resource.get(STS_API_ONE_URL_V2) # load configured customized message to memory if self.data[SERVICE_TYPE] in [SERVICE_TYPE_METADATA, SERVICE_TYPE_FILE]: message_config = load_message_config() diff --git a/src/essential_validator.py b/src/essential_validator.py index 6f3382c3..7d91921e 100644 --- a/src/essential_validator.py +++ b/src/essential_validator.py @@ -12,7 +12,8 @@ ERRORS, S3_DOWNLOAD_DIR, SQS_NAME, BATCH_ID, BATCH_STATUS_UPLOADED, SQS_TYPE, TYPE_LOAD, STATUS_PASSED,\ BATCH_STATUS_FAILED, ID, FILE_NAME, TYPE, FILE_PREFIX, MODEL_VERSION, MODEL_FILE_DIR, \ TIER_CONFIG, STATUS_ERROR, STATUS_NEW, SERVICE_TYPE_ESSENTIAL, SUBMISSION_ID, SUBMISSION_INTENTION_DELETE, NODE_TYPE, \ - SUBMISSION_INTENTION, TYPE_DELETE, BATCH_BUCKET, METADATA_VALIDATION_STATUS, STATUS_WARNING, DCF_PREFIX, NODE_IDS, DELETE_ALL, EXCLUSIVE_IDS + SUBMISSION_INTENTION, TYPE_DELETE, BATCH_BUCKET, METADATA_VALIDATION_STATUS, STATUS_WARNING, DCF_PREFIX, NODE_IDS, DELETE_ALL, EXCLUSIVE_IDS, \ + DELETE_ORPHANED_DATA_FILES, FILE_ERRORS from common.utils import cleanup_s3_download_dir, get_exception_msg, dump_dict_to_json, removeTailingEmptyColumnsAndRows, validate_uuid_by_rex, get_date_time from common.model_store import ModelFactory from metadata_remover import MetadataRemover @@ -117,26 +118,33 @@ def essentialValidate(configs, job_queue, mongo_dao): extender = VisibilityExtender(msg, VISIBILITY_TIMEOUT) submission_id = data.get(SUBMISSION_ID) node_type = data.get(NODE_TYPE) - node_ids = data.get(NODE_IDS) + node_ids = data.get(NODE_IDS) or [] delete_all = data.get(DELETE_ALL) - exclusive_ids = data.get(EXCLUSIVE_IDS) + exclusive_ids = data.get(EXCLUSIVE_IDS) or [] + delete_orphaned_data_files = data.get(DELETE_ORPHANED_DATA_FILES, False) validator = MetadataRemover(mongo_dao, model_store) + orphan_errors = [] + result = None try: if delete_all: # get all node ids for the node type in the submission node_type_ids = mongo_dao.search_nodes_by_type_and_submission(node_type, submission_id, exclusive_ids) node_ids = node_type_ids - result = validator.remove_metadata(submission_id, node_type, node_ids) - except Exception as e: # catch any unhandled errors + result, orphan_errors = validator.remove_metadata(submission_id, node_type, node_ids, delete_orphaned_data_files) + except Exception: error = f'{submission_id}: Failed to delete metadata, {get_exception_msg()}!' log.error(error) + orphan_errors = [] finally: - #5. update submission's metadataValidationStatus - if validator.submission: - status = validator.submission.get(METADATA_VALIDATION_STATUS) - # only need update the status if error or warning. In the dao function will check the count of error or warning to get real time status. - status = STATUS_PASSED if status in [STATUS_ERROR, STATUS_WARNING] else status - mongo_dao.set_submission_validation_status(validator.submission, None, status, None, None, True) + # Only update when remove_metadata returned (result is True or False); skip if exception before return + if validator.submission and result is not None: + fresh_submission = mongo_dao.get_submission(submission_id) + submission_for_update = fresh_submission or validator.submission + existing = (fresh_submission.get(FILE_ERRORS) or []) if fresh_submission else (validator.submission.get(FILE_ERRORS) or []) + combined_file_errors = existing + (orphan_errors if result and orphan_errors else []) + status = submission_for_update.get(METADATA_VALIDATION_STATUS) + status = STATUS_PASSED if status in [STATUS_ERROR, STATUS_WARNING] else status + mongo_dao.set_submission_validation_status(submission_for_update, None, status, None, combined_file_errors, True) else: log.error(f'Invalid message: {data}!') diff --git a/src/file_validator.py b/src/file_validator.py index a1922840..d7326f2f 100644 --- a/src/file_validator.py +++ b/src/file_validator.py @@ -8,7 +8,8 @@ from common.constants import ERRORS, WARNINGS, STATUS, S3_FILE_INFO, ID, SIZE, MD5, UPDATED_AT, \ FILE_NAME, SQS_TYPE, SQS_NAME, FILE_ID, STATUS_ERROR, STATUS_WARNING, STATUS_PASSED, SUBMISSION_ID, \ BATCH_BUCKET, SERVICE_TYPE_FILE, LAST_MODIFIED, CREATED_AT, TYPE, SUBMISSION_INTENTION, SUBMISSION_INTENTION_DELETE,\ - VALIDATION_ID, VALIDATION_ENDED, QC_RESULT_ID, VALIDATION_TYPE_FILE, QC_SEVERITY, QC_VALIDATE_DATE, FILE_VALIDATION + VALIDATION_ID, VALIDATION_ENDED, QC_RESULT_ID, VALIDATION_TYPE_FILE, QC_SEVERITY, QC_VALIDATE_DATE, FILE_VALIDATION, \ + DATA_FILE_TYPE, QC_VALIDATION_TYPE, SUBMITTED_ID, BATCH_ID, DISPLAY_ID, UPLOADED_DATE from common.utils import get_exception_msg, current_datetime, get_s3_file_info, get_s3_file_md5, create_error, get_uuid_str from service.ecs_agent import set_scale_in_protection from metadata_validator import get_qc_result @@ -330,21 +331,21 @@ def validate_all_files(self, submission_id): file_name = file.key.split('/')[-1] if file_name not in manifest_file_names: - file_batch = self.mongo_dao.find_batch_by_file_name(submission_id, "data file", file_name) + file_batch = self.mongo_dao.find_batch_by_file_name(submission_id, DATA_FILE_TYPE, file_name) batchID = file_batch[ID] if file_batch else "-" - displayID = file_batch["displayID"] if file_batch else None + displayID = file_batch[DISPLAY_ID] if file_batch else None msg = f'Data file “{file_name}”: associated metadata not found. Please upload associated metadata (aka. manifest) file' self.log.error(msg) error = { - TYPE: "data file", - "validationType": "data file", - "submittedID": file_name, - "batchID": batchID, - "displayID": displayID, - "severity": "Error", - "uploadedDate": file.last_modified, - "validatedDate": current_datetime(), - "errors": [create_error("F008", [file_name], "file name", file_name)] + TYPE: DATA_FILE_TYPE, + QC_VALIDATION_TYPE: DATA_FILE_TYPE, + SUBMITTED_ID: file_name, + BATCH_ID: batchID, + DISPLAY_ID: displayID, + QC_SEVERITY: STATUS_ERROR, + UPLOADED_DATE: file.last_modified, + QC_VALIDATE_DATE: current_datetime(), + ERRORS: [create_error("F008", [file_name], "file name", file_name)] } errors.append(error) missing_count += 1 diff --git a/src/metadata_remover.py b/src/metadata_remover.py index a117e870..ea5b3534 100644 --- a/src/metadata_remover.py +++ b/src/metadata_remover.py @@ -6,9 +6,14 @@ import os from bento.common.utils import get_logger from bento.common.s3 import S3Bucket -from common.constants import DATA_COMMON_NAME,NODE_ID, FILE_NAME, MODEL_VERSION, ROOT_PATH, \ - SUBMISSION_ID,NODE_TYPE, S3_FILE_INFO, BATCH_BUCKET, PARENT_TYPE, PARENTS -from common.utils import get_exception_msg +from common.constants import ( + DATA_COMMON_NAME, NODE_ID, FILE_NAME, MODEL_VERSION, ROOT_PATH, + SUBMISSION_ID, NODE_TYPE, S3_FILE_INFO, BATCH_BUCKET, PARENT_TYPE, PARENT_ID_VAL, PARENTS, ID, TYPE, + DATA_FILE_TYPE, S3_LIST_ORPHANS_PAGE_SIZE, + SUBMITTED_ID, QC_VALIDATION_TYPE, BATCH_ID, DISPLAY_ID, QC_SEVERITY, + UPLOADED_DATE, QC_VALIDATE_DATE, ERRORS, STATUS_ERROR, +) +from common.utils import get_exception_msg, create_error, current_datetime """ Process delete metadata requests. @@ -17,6 +22,7 @@ class MetadataRemover: def __init__(self, mongo_dao, model_store): self.fileList = [] #list of files object {file_name, file_path, file_size, invalid_reason} + self.errors = [] self.log = get_logger('Essential Validator') self.mongo_dao = mongo_dao self.model_store = model_store @@ -28,19 +34,29 @@ def __init__(self, mongo_dao, model_store): self.bucket = None self.def_file_nodes = None - def remove_metadata(self, submission_id, node_type, node_ids): + def remove_metadata(self, submission_id, node_type, node_ids, delete_orphaned_data_files=False): + """ + Delete metadata dataRecords for the given submission / node type / ids. + + delete_orphaned_data_files: + False (default): Remove Mongo dataRecords only (including cascaded children). + Do not delete S3 objects for removed file nodes; orphan scan still emits F008 + for unreferenced keys and does not delete them from S3. + True: Also delete S3 objects for removed file nodes during the cascade, delete + orphan keys in the post-pass scan, and emit F008 as today. + """ msg = None try: #1 validate submission submission = self.mongo_dao.get_submission(submission_id) - if not submission or not submission.get(DATA_COMMON_NAME): + if not submission: msg = f'Invalid submission, no record found, {submission_id}!' self.log.error(msg) - return False - if not submission.get(DATA_COMMON_NAME): - msg = f'Invalid submission, no datacommon found, {submission_id}!' + return (False, []) + if not submission.get(DATA_COMMON_NAME): + msg = f'Invalid submission, missing {DATA_COMMON_NAME}, {submission_id}!' self.log.error(msg) - return False + return (False, []) self.submission = submission self.datacommon = submission.get(DATA_COMMON_NAME) self.submission_id = submission_id @@ -50,19 +66,21 @@ def remove_metadata(self, submission_id, node_type, node_ids): if not self.model.model or not self.model.get_nodes(): msg = f'{self.datacommon} model version "{model_version}" is not available.' self.log.error(msg) - return False + return (False, []) self.def_file_nodes = self.model.get_file_nodes() self.bucket = S3Bucket(submission.get(BATCH_BUCKET)) - #2. validate meatadata for the type and ids + #2. validate metadata for the type and ids existed_nodes = self.validate_data(submission_id, node_type, node_ids) if not existed_nodes or len(existed_nodes) == 0: - return False - return self.delete_nodes(existed_nodes) - except Exception as e: - self.log.exception(e) - msg = f'Failed to delete metadata, {get_exception_msg()}!' - self.log.exception(msg) - return False + return (False, []) + if not self.delete_nodes(existed_nodes, delete_orphaned_data_files): + return (False, []) + #3. after successful delete: find orphaned files, optionally delete them, build F008 errors + orphan_errors = self._find_orphaned_files_and_build_errors(submission_id, delete_orphaned_data_files) + return True, orphan_errors + except Exception: + self.log.exception(f'Failed to delete metadata, {get_exception_msg()}!') + return False, [] def validate_data(self, submission_id, node_type, node_ids): """ @@ -88,16 +106,23 @@ def validate_data(self, submission_id, node_type, node_ids): return existed_nodes - def delete_nodes(self, existed_nodes): + def delete_nodes(self, existed_nodes, delete_orphaned_data_files=False): """ - remove metadata + Remove dataRecords for the given nodes. When delete_orphaned_data_files is True, + also remove their S3 file objects; when False, skip S3 for this batch and rely on + the orphan pass for F008 reporting. """ if len(existed_nodes) == 0: return True deleted_file_nodes = [node[S3_FILE_INFO] for node in existed_nodes if node.get(S3_FILE_INFO)] try: if self.mongo_dao.delete_data_records(existed_nodes): - return self.delete_files_in_s3(deleted_file_nodes) and self.process_children(existed_nodes) + s3_ok = ( + self.delete_files_in_s3(deleted_file_nodes) + if delete_orphaned_data_files + else True + ) + return s3_ok and self.process_children(existed_nodes, delete_orphaned_data_files) else: self.errors.append(f'deleting metadata failed with database error. Please try again and contact the helpdesk if this error persists.') return False @@ -106,10 +131,12 @@ def delete_nodes(self, existed_nodes): self.log.exception(msg) return False - """ - process related children record in dataRecords - """ - def process_children(self, deleted_nodes): + def process_children(self, deleted_nodes, delete_orphaned_data_files=False): + """ + Update or delete child dataRecords after parents are removed. + When delete_orphaned_data_files is False, child file nodes are removed from Mongo + but their S3 objects are left in place (orphan scan reports F008). + """ # retrieve child nodes status, child_nodes = self.mongo_dao.get_nodes_by_parents(deleted_nodes, self.submission_id) if not status: # if exception occurred @@ -123,10 +150,13 @@ def process_children(self, deleted_nodes): deleted_child_nodes = [] updated_child_nodes = [] file_nodes = [] - parent_types = [item[NODE_TYPE] for item in deleted_nodes] + deleted_parent_keys = {(item[NODE_TYPE], item[NODE_ID]) for item in deleted_nodes} file_def_types = self.def_file_nodes.keys() for node in child_nodes: - parents = list(filter(lambda x: (x[PARENT_TYPE] not in parent_types), node.get(PARENTS))) + parents = [ + p for p in (node.get(PARENTS) or []) + if (p.get(PARENT_TYPE), p.get(PARENT_ID_VAL)) not in deleted_parent_keys + ] if len(parents) == 0: #delete if no other parents deleted_child_nodes.append(node) if node.get(NODE_TYPE) in file_def_types and node.get(S3_FILE_INFO): @@ -146,10 +176,13 @@ def process_children(self, deleted_nodes): if len(deleted_child_nodes) > 0: deleted_results = self.mongo_dao.delete_data_records(deleted_child_nodes) if updated_results and deleted_results: - #delete files - result = self.delete_files_in_s3(file_nodes) + result = ( + self.delete_files_in_s3(file_nodes) + if delete_orphaned_data_files + else True + ) if result: # delete grand children... - if not self.process_children(deleted_child_nodes): + if not self.process_children(deleted_child_nodes, delete_orphaned_data_files): self.errors.append(f'deleting metadata failed with database error. Please try again and contact the helpdesk if this error persists.') rtn_val = rtn_val and False else: @@ -158,7 +191,82 @@ def process_children(self, deleted_nodes): self.errors.append(f'Deleting metadata failed with database error. Please try again and contact the helpdesk if this error persists.') rtn_val = rtn_val and False return rtn_val - + + def _process_s3_list_page(self, response, manifest_file_names, orphan_s3_infos): + """Process one page of list_objects_v2 response; append orphan items to orphan_s3_infos. Return NextContinuationToken.""" + for item in response.get("Contents") or []: + obj_key = item.get("Key") or "" + if "/log" in obj_key: + continue + file_name = obj_key.split("/")[-1] + if not file_name or file_name in manifest_file_names: + continue + orphan_s3_infos.append({ + FILE_NAME: file_name, + "last_modified": item.get("LastModified"), + }) + return response.get("NextContinuationToken") + + def _find_orphaned_files_and_build_errors(self, submission_id, delete_orphaned_data_files): + """ + After metadata deletion: find S3 keys under file/ not referenced by any remaining dataRecord. + Always returns F008-shaped errors for those orphans. + If delete_orphaned_data_files is True, also delete those orphan objects from S3. + If False, only report F008 (objects remain in the bucket). + """ + if not self.bucket or not self.root_path: + return [] + orphan_errors = [] + try: + manifest_info_list = self.mongo_dao.get_files_by_submission(submission_id) or [] + manifest_file_names = set() + for manifest_info in manifest_info_list: + if manifest_info.get(S3_FILE_INFO) and manifest_info[S3_FILE_INFO].get(FILE_NAME): + manifest_file_names.add(manifest_info[S3_FILE_INFO][FILE_NAME]) + + # S3 keys use forward slashes; paginate list_objects_v2 (first page, then while token) + key = (os.path.join(self.root_path, "file") + "/").replace("\\", "/") + orphan_s3_infos = [] + + response = self.bucket.client.list_objects_v2( + Bucket=self.bucket.bucket_name, + Prefix=key, + MaxKeys=S3_LIST_ORPHANS_PAGE_SIZE, + ) + continuation_token = self._process_s3_list_page(response, manifest_file_names, orphan_s3_infos) + while continuation_token: + response = self.bucket.client.list_objects_v2( + Bucket=self.bucket.bucket_name, + Prefix=key, + MaxKeys=S3_LIST_ORPHANS_PAGE_SIZE, + ContinuationToken=continuation_token, + ) + continuation_token = self._process_s3_list_page(response, manifest_file_names, orphan_s3_infos) + + if delete_orphaned_data_files and orphan_s3_infos: + self.delete_files_in_s3([{FILE_NAME: info[FILE_NAME]} for info in orphan_s3_infos]) + + for info in orphan_s3_infos: + file_name = info[FILE_NAME] + file_batch = self.mongo_dao.find_batch_by_file_name(submission_id, DATA_FILE_TYPE, file_name) + batch_id = file_batch[ID] if file_batch else "-" + display_id = file_batch.get(DISPLAY_ID) if file_batch else None + error = { + TYPE: DATA_FILE_TYPE, + QC_VALIDATION_TYPE: DATA_FILE_TYPE, + SUBMITTED_ID: file_name, + BATCH_ID: batch_id, + DISPLAY_ID: display_id, + QC_SEVERITY: STATUS_ERROR, + UPLOADED_DATE: info.get("last_modified"), + QC_VALIDATE_DATE: current_datetime(), + ERRORS: [create_error("F008", [file_name], "file name", file_name)], + } + orphan_errors.append(error) + except Exception: + self.log.exception(f"Failed to find orphaned files or build F008 errors: {get_exception_msg()}") + return orphan_errors + """ delete files in s3 after deleted file nodes """ diff --git a/src/metadata_validator.py b/src/metadata_validator.py index 31fa2084..a061561f 100644 --- a/src/metadata_validator.py +++ b/src/metadata_validator.py @@ -4,26 +4,207 @@ import re from bento.common.sqs import VisibilityExtender from bento.common.utils import get_logger, DATE_FORMATS -from common.constants import SQS_NAME, SQS_TYPE, SCOPE, SUBMISSION_ID, ERRORS, WARNINGS, STATUS_ERROR, ID, FAILED, \ +from common.constants import SQS_NAME, SQS_TYPE, SCOPE, SUBMISSION_ID, ERRORS, WARNINGS, STATUS_ERROR, FAILED, ID, \ STATUS_WARNING, STATUS_PASSED, STATUS, UPDATED_AT, MODEL_FILE_DIR, TIER_CONFIG, DATA_COMMON_NAME, MODEL_VERSION, \ NODE_TYPE, PROPERTIES, TYPE, MIN, MAX, VALUE_EXCLUSIVE, VALUE_PROP, VALIDATION_RESULT, ORIN_FILE_NAME, \ VALIDATED_AT, SERVICE_TYPE_METADATA, NODE_ID, PROPERTIES, PARENTS, KEY, NODE_ID, PARENT_TYPE, PARENT_ID_NAME, PARENT_ID_VAL, \ SUBMISSION_INTENTION, SUBMISSION_INTENTION_NEW_UPDATE, SUBMISSION_INTENTION_DELETE, TYPE_METADATA_VALIDATE, TYPE_CROSS_SUBMISSION, \ - SUBMISSION_REL_STATUS_RELEASED, VALIDATION_ID, VALIDATION_ENDED, CDE_TERM, TERM_CODE, TERM_VERSION, CDE_PERMISSIVE_VALUES, \ + SUBMISSION_REL_STATUS_RELEASED, VALIDATION_ID, VALIDATION_ENDED, PROPERTY_TERM, \ QC_RESULT_ID, BATCH_IDS, VALIDATION_TYPE_METADATA, S3_FILE_INFO, VALIDATION_TYPE_FILE, QC_SEVERITY, QC_VALIDATE_DATE, QC_ORIGIN, \ QC_ORIGIN_METADATA_VALIDATE_SERVICE, QC_ORIGIN_FILE_VALIDATE_SERVICE, DISPLAY_ID, UPLOADED_DATE, LATEST_BATCH_ID, SUBMITTED_ID, \ LATEST_BATCH_DISPLAY_ID, QC_VALIDATION_TYPE, DATA_RECORD_ID, PV_TERM, STUDY_ID, PROPERTY_PATTERN, DELETE_COMMAND, CONCEPT_CODE, \ - GENERATED_PROPS, DELETE_COMMAND, METADATA_VALIDATION, CONSENT_CODE_NODE_TYPE, CONSENT_CODE, CONSENT_GROUP_NUMBER, DATA_COMMONS, STUDY_ID -from common.utils import current_datetime, get_exception_msg, dump_dict_to_json, create_error, get_uuid_str + GENERATED_PROPS, METADATA_VALIDATION, CONSENT_CODE_NODE_TYPE, CONSENT_CODE, CONSENT_GROUP_NUMBER, NAME_PROP, \ + TYPE_METADATA_VALIDATE_BATCH, DATA_RECORD_IDS, TOTAL_BATCHES, BATCH_INDEX +from common.utils import current_datetime, get_exception_msg, create_error, get_uuid_str, has_permissive_value from common.model_store import ModelFactory from common.model_reader import valid_prop_types from service.ecs_agent import set_scale_in_protection from x_submission_validator import CrossSubmissionValidator -from pv_puller import get_pv_by_code_version +from pv_puller_v2 import get_all_pvs_by_version VISIBILITY_TIMEOUT = 20 BATCH_SIZE = 1000 -CDE_NOT_FOUND = "CDE not available" +PROPERTY_NOT_FOUND = "Permissible values not available" + + +def _process_metadata_batch(mongo_dao, model_store, configs, data): + """ + Process a single batched metadata validation message. + + Required message fields (camelCase JSON keys): + validationID, + submissionID, + scope, + dataRecordIds, + totalBatches (must be >= 1), + batchIndex + + If validationID is missing or totalBatches < 1, the message is rejected without calling the DB; + the validation will not complete and may appear stuck. + The backend must always send valid validationID and totalBatches. + + Returns the MetaDataValidator instance for cleanup by the caller, + or None if the batch message is invalid. + """ + log = get_logger(__name__) + validation_id = data.get(VALIDATION_ID) + submission_id = data.get(SUBMISSION_ID) + data_record_ids = data.get(DATA_RECORD_IDS, []) + total_batches = data.get(TOTAL_BATCHES, 1) + batch_index = data.get(BATCH_INDEX, 0) + scope = data.get(SCOPE) + validated = False + validator = None + submission = None + batch_status = FAILED + status_detail = None + + if not validation_id: + log.error(f'Invalid batch message - missing validationID submission_id={submission_id}: {data}') + log.critical('Batch message rejected: missing validationID; validation may be stuck.') + return None + + if total_batches < 1: + log.error(f'Invalid batch message - total_batches must be >= 1, got {total_batches} submission_id={submission_id} validation_id={validation_id}: {data}') + log.critical(f'Batch message rejected: total_batches={total_batches}; validation may be stuck.') + return None + + batch_label = f'batch {batch_index + 1}/{total_batches}' + log.info(f'Processing {batch_label} submission_id={submission_id} validation_id={validation_id} ' + f'({len(data_record_ids)} records)') + + try: + submission = mongo_dao.get_submission(submission_id) + if not submission: + batch_status = FAILED + status_detail = f'Submission not found: {submission_id}' + log.error(f'{status_detail} validation_id={validation_id} {batch_label}') + return None + + if not scope: + batch_status = FAILED + status_detail = 'Missing required field: scope' + log.error(f'Invalid batch message - missing scope submission_id={submission_id} validation_id={validation_id} {batch_label}: {data}') + return None + + if not data_record_ids: + batch_status = FAILED + status_detail = 'Empty dataRecordIds in batch message' + log.error(f'Invalid batch message - empty dataRecordIds submission_id={submission_id} validation_id={validation_id} {batch_label}: {data}') + return None + + data_records = mongo_dao.get_dataRecords_by_ids(data_record_ids) + if not data_records: + batch_status = FAILED + status_detail = f'No data records found for provided IDs in batch {batch_index}' + log.error(f'{status_detail} submission_id={submission_id} validation_id={validation_id} {batch_label}') + return None + + validator = MetaDataValidator(mongo_dao, model_store, configs) + init_error = validator._initialize_for_validation(submission, submission_id, scope) + if init_error: + batch_status, status_detail = init_error + return validator + + validator.validate_nodes(data_records) + validated = True + if validator.isError: + batch_status = STATUS_ERROR + elif validator.isWarning: + batch_status = STATUS_WARNING + else: + batch_status = STATUS_PASSED + + except Exception as ve: + log.exception(f'Error validating batch submission_id={submission_id} validation_id={validation_id} {batch_label}: {ve}') + finally: + try: + completed_count, is_last_batch, failed_count, worst_status, batch_details = \ + mongo_dao.increment_completed_batches( + validation_id, total_batches, + batch_failed=(not validated), + batch_status=batch_status, + status_detail=status_detail if not validated else None, + submission_id=submission_id, + batch_index=batch_index, + ) + + if is_last_batch: + log.info(f'All {total_batches} batches complete submission_id={submission_id} validation_id={validation_id}') + final_status = worst_status + # statusDetail is a list of failure messages for batch runs, or None when no failures. + final_detail = batch_details if batch_details else None + if failed_count > 0: + log.error(f'Validation submission_id={submission_id} validation_id={validation_id}: {failed_count} of {total_batches} batches failed') + validation_end_at = current_datetime() + + update_ok = mongo_dao.update_validation_status( + validation_id, final_status, validation_end_at, METADATA_VALIDATION, + status_detail=final_detail, + submission_id=submission_id, + ) + if not update_ok: + log.warning( + f'Validation submission_id={submission_id} validation_id={validation_id}: ' + 'status update reported no modification; validation record may be stale.' + ) + + if submission: + sub_doc = validator.submission if validator else submission + sub_doc[VALIDATION_ENDED] = validation_end_at + mongo_dao.set_submission_validation_status( + sub_doc, None, final_status, None, None, + status_detail=final_detail, + scope=scope, + ) + + log.info(f'Validation completed submission_id={submission_id} validation_id={validation_id} status={final_status}') + elif completed_count is not None: + log.info(f'{batch_label} complete submission_id={submission_id} validation_id={validation_id} ' + f'{total_batches - completed_count} batches remaining') + else: + log.error(f'Failed to record {batch_label} completion submission_id={submission_id} validation_id={validation_id} -- validation may be stuck') + except Exception as fe: + log.exception(f'Failed to finalize batch submission_id={submission_id} validation_id={validation_id} {batch_label}: {fe}') + + return validator + + +def _process_metadata_validation(mongo_dao, model_store, configs, data): + """Handle a standard (non-batched) metadata validation message. + + Returns the MetaDataValidator instance for cleanup by the caller, + or None if SCOPE or VALIDATION_ID is missing from data. + """ + submission_id = data.get(SUBMISSION_ID) + scope = data.get(SCOPE) + validation_id = data.get(VALIDATION_ID) + if not scope or not validation_id: + log = get_logger(__name__) + log.error(f'Missing required field for metadata validation: scope={scope}, validation_id={validation_id}') + return None + validator = MetaDataValidator(mongo_dao, model_store, configs) + status = validator.validate(submission_id, scope) + validation_end_at = current_datetime() + update_status = mongo_dao.update_validation_status(validation_id, status, validation_end_at, METADATA_VALIDATION, status_detail=None, submission_id=submission_id) + if update_status: + validator.submission[VALIDATION_ENDED] = validation_end_at + mongo_dao.set_submission_validation_status(validator.submission, None, status, None, None, status_detail=None, scope=scope) + return validator + + +def _process_cross_submission(mongo_dao, data): + """Handle a cross-submission validation message. + + Returns the CrossSubmissionValidator instance for cleanup by the caller. + """ + submission_id = data.get(SUBMISSION_ID) + validator = CrossSubmissionValidator(mongo_dao) + status = validator.validate(submission_id) + if validator.submission: + mongo_dao.set_submission_validation_status(validator.submission, None, None, status, None) + return validator + def metadataValidate(configs, job_queue, mongo_dao): log = get_logger('Metadata Validation Service') @@ -36,7 +217,6 @@ def metadataValidate(configs, job_queue, mongo_dao): log.exception(f'Error occurred when initialize metadata validation service: {get_exception_msg()}') return 1 - #step 3: run validator as a service log.info(f'{SERVICE_TYPE_METADATA} service started') batches_processed = 0 scale_in_protection_flag = False @@ -63,24 +243,17 @@ def metadataValidate(configs, job_queue, mongo_dao): log.debug(data) extender = VisibilityExtender(msg, VISIBILITY_TIMEOUT) submission_id = data.get(SUBMISSION_ID) + if data.get(SQS_TYPE) == TYPE_METADATA_VALIDATE and submission_id and data.get(SCOPE) and data.get(VALIDATION_ID): - scope = data[SCOPE] - validator = MetaDataValidator(mongo_dao, model_store, configs) - status = validator.validate(submission_id, scope) - validation_id = data[VALIDATION_ID] - validation_end_at = current_datetime() - update_status =mongo_dao.update_validation_status(validation_id, status, validation_end_at, METADATA_VALIDATION) - if update_status: - validator.submission[VALIDATION_ENDED] = validation_end_at - mongo_dao.set_submission_validation_status(validator.submission, None, status, None, None) + validator = _process_metadata_validation(mongo_dao, model_store, configs, data) elif data.get(SQS_TYPE) == TYPE_CROSS_SUBMISSION and submission_id: - validator = CrossSubmissionValidator(mongo_dao) - status = validator.validate(submission_id) - if validator.submission: - mongo_dao.set_submission_validation_status(validator.submission, None, None, status, None) + validator = _process_cross_submission(mongo_dao, data) + elif data.get(SQS_TYPE) == TYPE_METADATA_VALIDATE_BATCH and submission_id: + validator = _process_metadata_batch(mongo_dao, model_store, configs, data) + # Log and skip invalid or incomplete message (wrong type or missing required fields). else: log.error(f'Invalid message: {data}!') - log.info(f'Processed {SERVICE_TYPE_METADATA} validation for the submission: {data[SUBMISSION_ID]}!') + log.info(f'Processed {SERVICE_TYPE_METADATA} validation for the submission: {data.get(SUBMISSION_ID)}!') batches_processed += 1 msg.delete() except Exception as e: @@ -96,6 +269,8 @@ def metadataValidate(configs, job_queue, mongo_dao): except KeyboardInterrupt: log.info('Good bye!') return + + class MetaDataValidator: def __init__(self, mongo_dao, model_store, config): @@ -110,48 +285,59 @@ def __init__(self, mongo_dao, model_store, config): self.isError = None self.isWarning = None self.searched_sts = False - self.not_found_cde = False + self.not_found_property = False self.study_name = None self.program_names = None - def validate(self, submission_id, scope): - #1. # get data common from submission - submission = self.mongo_dao.get_submission(submission_id) - if not submission: - msg = f'Invalid submissionID, no submission found, {submission_id}!' - self.log.error(msg) - return FAILED - if not submission.get(DATA_COMMON_NAME): - msg = f'Invalid submission, no datacommon found, {submission_id}!' - self.log.error(msg) - return FAILED + def _initialize_for_validation(self, submission, submission_id, scope): + """Shared initialization for both batch and non-batch validation paths. + + Returns None on success, or a tuple (error_status, detail_message) on failure. + """ + self.submission = submission self.submission_id = submission_id self.scope = scope - self.submission = submission - datacommon = submission.get(DATA_COMMON_NAME) - self.datacommon = datacommon - # get study name and program name(s) from submission and/or study for name validation required in CRDCDH-2431 + self.datacommon = submission.get(DATA_COMMON_NAME) + + if not self.datacommon: + msg = f'Invalid submission, no datacommon found, {submission_id}!' + self.log.error(msg) + return FAILED, msg + + model_version = submission.get(MODEL_VERSION) + self.model = self.model_store.get_model_by_data_common_version(self.datacommon, model_version) + if not self.model.model or not self.model.get_nodes(): + msg = f'{self.datacommon} model version "{model_version}" is not available.' + self.log.error(msg) + return FAILED, msg + study_id = submission.get(STUDY_ID) if not study_id: msg = f'Invalid submission, no study id found, {submission_id}!' self.log.error(msg) - return FAILED + return FAILED, msg + study = self.mongo_dao.find_study_by_id(study_id) if not study: msg = f'Invalid submission, no study found, {submission_id}!' self.log.error(msg) - return FAILED + return FAILED, msg + self.study_name = study.get("studyName") self.program_names = self.mongo_dao.find_organization_name_by_study_id(study_id) - - model_version = submission.get(MODEL_VERSION) - #2 get data model based on datacommon and version - self.model = self.model_store.get_model_by_data_common_version(datacommon, model_version) - if not self.model.model or not self.model.get_nodes(): - msg = f'{self.datacommon} model version "{model_version}" is not available.' - self.log.error(msg) - return STATUS_ERROR - #3 retrieve data batch by batch + return None + + def validate(self, submission_id, scope): + submission = self.mongo_dao.get_submission(submission_id) + if not submission: + self.log.error(f'Invalid submissionID, no submission found, {submission_id}!') + return FAILED + + init_error = self._initialize_for_validation(submission, submission_id, scope) + if init_error: + return init_error[0] + + # retrieve data batch by batch start_index = 0 validated_count = 0 while True: @@ -169,7 +355,6 @@ def validate(self, submission_id, scope): start_index += count def validate_nodes(self, data_records): - #2. loop through all records and call validateNode updated_records = [] qc_results = [] validated_count = 0 @@ -202,24 +387,6 @@ def validate_nodes(self, data_records): else: qc_result[WARNINGS] = [] - qc_result[QC_VALIDATE_DATE] = current_datetime() - if not qc_result: - record[QC_RESULT_ID] = None - qc_result = get_qc_result(record, VALIDATION_TYPE_METADATA, self.mongo_dao) - if errors and len(errors) > 0: - self.isError = True - qc_result[ERRORS] = errors - qc_result[QC_SEVERITY] = STATUS_ERROR - else: - qc_result[ERRORS] = [] - if warnings and len(warnings)> 0: - self.isWarning = True - qc_result[WARNINGS] = warnings - if not errors or len(errors) == 0: - qc_result[QC_SEVERITY] = STATUS_WARNING - else: - qc_result[WARNINGS] = [] - qc_result[QC_VALIDATE_DATE] = current_datetime() qc_results.append(qc_result) record[QC_RESULT_ID] = qc_result[ID] @@ -234,7 +401,6 @@ def validate_nodes(self, data_records): self.log.exception(msg) self.isError = True - #3. update data records based on record's _id if len(qc_results) > 0: result = self.mongo_dao.save_qc_results(qc_results) if not result: @@ -243,7 +409,6 @@ def validate_nodes(self, data_records): result = self.mongo_dao.update_data_records_status(updated_records) if not result: - #4. set errors in submission msg = f'Failed to update dataRecords for the submission, {self.submission_id} at scope, {self.scope}!' self.log.error(msg) self.isError = True @@ -571,19 +736,23 @@ def validate_relationship(self, data_record, msg_prefix): result["result"] = STATUS_PASSED return result - def get_file_consent_code(self, parent_type, parent_id_value, consent_group_parents): - # find grandparent in array of tuple (parent_type, parentIDPropName, parent_id_value) + def get_file_consent_code(self, parent_type, parent_id_value, consent_group_parents, visited=None): + if visited is None: + visited = set() + node_key = (parent_type, parent_id_value) + if node_key in visited: + self.log.warning(f'Circular parent reference detected at ({parent_type}, {parent_id_value}), skipping') + return + visited.add(node_key) grandparent_nodes = self.mongo_dao.find_grandparent_by_parent(parent_type, parent_id_value, self.submission_id, self.datacommon) if grandparent_nodes: - # check if the grandparent node is of type "consent_group" consent_groups = [item for item in grandparent_nodes if item[0] == CONSENT_CODE_NODE_TYPE] if consent_groups: - # check if the consent group is already in the list consent_group_parents.update(consent_groups) return else: for grandparent in grandparent_nodes: - self.get_file_consent_code(grandparent[0], grandparent[2], consent_group_parents) + self.get_file_consent_code(grandparent[0], grandparent[2], consent_group_parents, visited) return def get_unique_child_node_ids(self, data_common, node_type, parent_node, submission_id): @@ -623,11 +792,10 @@ def validate_prop_value(self, prop_name, value, prop_def, msg_prefix, data_recor minimum = prop_def.get(MIN) maximum = prop_def.get(MAX) - permissive_vals, msg, check_concept_code, cde_code = self.get_permissive_value(prop_def) - if msg and msg == CDE_NOT_FOUND: - errors.append(create_error("M027", [msg_prefix, prop_name], prop_name, value)) + model = self.model.get_data_commons() + permissive_vals, msg, check_concept_code = self.get_permissive_value(prop_def) if check_concept_code == True: - self.set_concept_code(data_record, prop_name, value, cde_code) + self.set_concept_code(data_record, prop_name, value, model) if type == "string": val = str(value) result, error, corrected_value = check_permissive(val, permissive_vals, msg_prefix, prop_name, self.mongo_dao) @@ -708,7 +876,7 @@ def validate_prop_value(self, prop_name, value, prop_def, msg_prefix, data_recor return errors - def set_concept_code(self, data_record, prop_name, value, cde_code): + def set_concept_code(self, data_record, prop_name, value, model): """ set concept code for the property """ @@ -722,7 +890,7 @@ def set_concept_code(self, data_record, prop_name, value, cde_code): concept_code_values = [] for val in values: # get concept code by the value - result = self.mongo_dao.get_concept_code_by_pv(cde_code, val.strip()) + result = self.mongo_dao.get_concept_code_by_pv(prop_name, model, val.strip()) if result and result.get(CONCEPT_CODE): concept_code_values.append(result[CONCEPT_CODE]) @@ -730,8 +898,7 @@ def set_concept_code(self, data_record, prop_name, value, cde_code): data_record[GENERATED_PROPS] = {} data_record[GENERATED_PROPS].update({property_concept_code_name: list_delimiter.join(concept_code_values)}) - - + """ get permissible values of a property """ @@ -739,48 +906,37 @@ def get_permissive_value(self, prop_def): permissive_vals = prop_def.get("permissible_values") msg = None check_concept_code = False - cde_code = None - if prop_def.get(CDE_TERM) and len(prop_def.get(CDE_TERM)) > 0: - # retrieve permissible values from DB or cde site - cde_terms = [ct for ct in prop_def[CDE_TERM] if 'caDSR' in ct.get('Origin', '')] - if cde_terms and len(cde_terms) > 0: - cde_code = cde_terms[0].get(TERM_CODE) - cde_version = cde_terms[0].get(TERM_VERSION) - if not cde_code: - return permissive_vals, msg, check_concept_code, cde_code + model = self.model.get_data_commons() + version = self.model.get_model_version() + prop_name = prop_def.get(NAME_PROP) + #prop_type = prop_def.get(TYPE) + + if prop_def.get(PROPERTY_TERM) and len(prop_def.get(PROPERTY_TERM)) > 0: + # retrieve permissible values from DB or property site - cde = self.mongo_dao.get_cde_permissible_values(cde_code, cde_version) - if cde: - if cde.get(CDE_PERMISSIVE_VALUES) is not None: - if len(cde.get(CDE_PERMISSIVE_VALUES)) > 0: - permissive_vals = cde[CDE_PERMISSIVE_VALUES] - check_concept_code = True - else: - permissive_vals = None + prop = self.mongo_dao.get_property_permissible_values(model, version, prop_name) + if prop: + check_concept_code, permissive_vals = has_permissive_value(prop) else: if not self.searched_sts: - cde = get_pv_by_code_version(self.config, self.log, cde_code, cde_version, self.mongo_dao) + #if there is no record for the property in DB, call STS to pull all the property under the model and version to get the permissible values and save in DB, then call mongo_dao to get the property record again. + get_all_pvs_by_version(self.config, self.log, version, model, self.mongo_dao) + prop = self.mongo_dao.get_property_permissible_values(model, version, prop_name) self.searched_sts = True - if cde: - if cde.get(CDE_PERMISSIVE_VALUES) is not None: - if len(cde[CDE_PERMISSIVE_VALUES]) > 0: - permissive_vals = cde[CDE_PERMISSIVE_VALUES] - check_concept_code = True - else: - permissive_vals = None #escape validation - + if prop: + check_concept_code, permissive_vals = has_permissive_value(prop) else: - msg = CDE_NOT_FOUND - self.not_found_cde = True + msg = PROPERTY_NOT_FOUND + self.not_found_property = True else: - if self.not_found_cde: - msg = CDE_NOT_FOUND + if self.not_found_property: + msg = PROPERTY_NOT_FOUND # strip white space if the value is string if permissive_vals and len(permissive_vals) > 0 and isinstance(permissive_vals[0], str): permissive_vals = [item.strip() for item in permissive_vals] - return permissive_vals, msg, check_concept_code, cde_code + return permissive_vals, msg, check_concept_code """util functions""" diff --git a/src/pv_puller.py b/src/pv_puller.py deleted file mode 100644 index f70fb3ac..00000000 --- a/src/pv_puller.py +++ /dev/null @@ -1,322 +0,0 @@ -#!/usr/bin/env python3 -from bento.common.utils import get_logger -from common.constants import TIER_CONFIG, CDE_API_URL, CDE_CODE, CDE_VERSION, CDE_FULL_NAME, STS_API_ALL_URL, STS_API_ONE_URL, \ - CDE_PERMISSIVE_VALUES, STS_DATA_RESOURCE_CONFIG, STS_DATA_RESOURCE_API, STS_DATA_RESOURCE_FILE, STS_DUMP_CONFIG, DATA_COMMONS_LIST, HIDDEN_MODELS, KEY -from common.utils import get_exception_msg -from common.api_client import APIInvoker - -MODEL_DEFS = "models" -CADSR_DATA_ELEMENT = "DataElement" -CADSR_VALUE_DOMAIN = "ValueDomain" -CADSR_DATA_ELEMENT_LONG_NAME = "longName" -CADSR_PERMISSIVE_VALUES = "PermissibleValues" -FILE_DOWNLOAD_URL = "download_url" -FILE_NAME = "name" -FILE_TYPE = "type" -CDE_PV_NAME = "permissibleValues" -NCIT_CDE_CONCEPT_CODE = "ncit_concept_code" -NCIT_SYNONYMS = "synonyms" -NCIT_VALUE = "value" - -def pull_pv_lists(configs, mongo_dao): - """ - Pull permissible values and synonyms from STS and save them to the database. - - :param configs: Configuration settings for the puller. - :param mongo_dao: Data access object for MongoDB operations. - """ - log = get_logger('Permissive values and synonym puller') - api_client = APIInvoker(configs) - pv_puller = PVPuller(configs, mongo_dao, api_client) - # synonym_puller = SynonymPuller(configs, mongo_dao, api_client) - - try: - # pull pv, cde, synonym, concept codes - pv_puller.pull_cde_pv_synonym_concept_codes() - # test get CDE by code and version - # get_pv_by_code_version(configs, log, "12447172", "1.00", mongo_dao) - except (KeyboardInterrupt, SystemExit): - print("Task is stopped...") - except Exception as e: - log.critical(e) - log.critical( - f'Something wrong happened while pulling permissive values! Check debug log for details.') -class PVPuller: - """ - Class for pulling permissible values from STS and saving them to the database. - """ - def __init__(self, configs, mongo_dao, api_client): - self.log = get_logger('Permissive values puller') - self.mongo_dao = mongo_dao - self.configs = configs - self.api_client = api_client - self.config_model_list = self.mongo_dao.get_configuration_by_ev_var([DATA_COMMONS_LIST, HIDDEN_MODELS]) - if self.config_model_list is not None: - if len(self.config_model_list) == 2: #if both data commons list and hidden models are configured - self.pv_models = [x for x in self.config_model_list[0][KEY] if x not in self.config_model_list[1][KEY]] - else: - self.pv_models = [] - else: - self.pv_models = [] - - def pull_cde_pv_synonym_concept_codes(self): - """ - pull cde pv from STS API (CDE_API_URL) and save to db - """ - resource = self.configs[STS_DATA_RESOURCE_CONFIG] if self.configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_API - # resource = self.configs[STS_DATA_RESOURCE_CONFIG] if self.configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_FILE - try: - cde_records, synonym_records, concept_codes_records = retrieveAllCDEViaAPI(self.configs, self.pv_models, self.log, self.api_client) if resource == STS_DATA_RESOURCE_API \ - else retrieveAllCDEViaDumpFile(self.configs, self.log, self.api_client) - if not cde_records or len(cde_records) == 0: - self.log.info("No CDE found!") - return - self.log.info(f"{len(cde_records)} unique CDE are retrieved!") - result, msg = self.mongo_dao.upsert_cde(list(cde_records)) - if result: - self.log.info(f"CDE PV are pulled and save successfully!") - else: - self.log.error(f"Failed to pull and save CDE PV! {msg}") - - if not synonym_records or len(synonym_records) == 0: - self.log.info("No synonym found!") - return - self.log.info(f"{len(synonym_records)} unique synonyms are retrieved!") - result = self.mongo_dao.insert_synonyms(list(synonym_records)) - if result is not None: - self.log.info(f"CDE Synonyms are pulled and save successfully!") - - if not concept_codes_records or len(concept_codes_records) == 0: - self.log.info("No concept code found!") - return - self.log.info(f"{len(concept_codes_records)} unique concept codes are retrieved!") - result = self.mongo_dao.insert_concept_codes(list(concept_codes_records)) - if result is not None: - self.log.info(f"CDE Concept Codes are pulled and save successfully!") - self.log.info(f"All CDE PVs, Synonyms and Concept Codes are pulled and saved successfully!") - return - except Exception as e: - self.log.exception(e) - self.log.exception(f"Failed to retrieve CDE PVs.") - -def retrieveAllCDEViaAPI(configs, pv_models, log, api_client=None): - """ - extract cde from cde dump file - """ - if len(pv_models) > 0: - sts_api_url = configs[STS_API_ALL_URL] + "&" + "&".join([f"model={model}" for model in pv_models]).replace(" ","%20") - else: - sts_api_url = configs[STS_API_ALL_URL] - log.info(f"Retrieving cde from {sts_api_url}...") - if not api_client: - api_client = APIInvoker(configs) - results = api_client.get_all_data_elements(sts_api_url) - if not results or len(results) == 0: - log.error(f"No cde/pvs retrieve from STS API, {sts_api_url}.") - return None, None, None - cde_records, synonym_set, concept_code_set = process_sts_cde_pv(results, log) - log.info(f"Retrieved CDE PVs from {sts_api_url}.") - return cde_records, synonym_set, concept_code_set - -def retrieveAllCDEViaDumpFile(configs, log, api_client=None): - """ - extract cde from cde dump file - """ - sts_file_url = configs[STS_DUMP_CONFIG].format(configs[TIER_CONFIG]) - log.info(f"Retrieving cde from {sts_file_url}...") - if not api_client: - api_client = APIInvoker(configs) - results = api_client.get_synonyms(sts_file_url) - cde_records, synonym_set, concept_code_set = process_sts_cde_pv(results, log) - log.info(f"Retrieved CDE PVs from {sts_file_url}.") - return cde_records, synonym_set, concept_code_set - -def process_sts_cde_pv(sts_results, log, cde_only=False): - """ - get cde pv from sts api - :param sts_api_url: sts api url - """ - cde_set = set() - cde_records = [] - synonym_set = set() - concept_code_set = set() - if not sts_results or len(sts_results) == 0: - log.error(f"No cde/pvs retrieve from STS API.") - return None, None, None - cde_list = [item for item in sts_results if item.get(CDE_CODE) and item.get(CDE_CODE) != 'null'] - if not cde_list or len(cde_list) == 0: - log.error(f"No cde found in STS API results.") - return None, None, None - for item in cde_list: - code = item.get(CDE_CODE) - version = item.get(CDE_VERSION) if item.get(CDE_VERSION) and item.get(CDE_VERSION) != 'null' else None - cde_key = (code, version) - if cde_key in cde_set: - continue - cde_set.add(cde_key) - cde_record = compose_cde_record(item) - cde_records.append( - cde_record - ) - if cde_only: - continue - # extract synonyms - if item.get(CDE_PV_NAME) and len(item.get(CDE_PV_NAME)) > 0 and item.get(CDE_PV_NAME)[0].get(NCIT_SYNONYMS): - compose_synonym_record(item, synonym_set) - - # extract concept codes - if item.get(CDE_PV_NAME) and len(item.get(CDE_PV_NAME)) > 0 and item.get(CDE_PV_NAME)[0].get(NCIT_CDE_CONCEPT_CODE): - compose_concept_code_record(item, concept_code_set) - - return cde_records, synonym_set, concept_code_set - -def extract_pv_list(cde_pv_list): - """ - extract pv list from cde dump file - """ - pv_list = None - if cde_pv_list and len(cde_pv_list) > 0 and cde_pv_list[0].get(NCIT_VALUE): - pv_list = [item.get(NCIT_VALUE) for item in cde_pv_list if item.get(NCIT_VALUE) is not None] - if cde_pv_list and any(item.get(NCIT_VALUE) for item in cde_pv_list): - pv_list = [item[NCIT_VALUE] for item in cde_pv_list if NCIT_VALUE in item and item[NCIT_VALUE] is not None] - contains_http = any(s for s in pv_list if isinstance(s, str) and s.startswith(("http:", "https:"))) - if contains_http: - return None - # strip white space if the value is a string - if pv_list and isinstance(pv_list[0], str): - pv_list = [item.strip() for item in pv_list] - - return pv_list - -def compose_cde_record(cde_item): - """ - compose cde record from cde dump file - """ - cde_record = { - CDE_FULL_NAME: cde_item.get(CDE_FULL_NAME), - CDE_CODE: cde_item.get(CDE_CODE), - CDE_VERSION: cde_item.get(CDE_VERSION) if cde_item.get(CDE_VERSION) and cde_item.get(CDE_VERSION) != 'null' else None, - CDE_PERMISSIVE_VALUES: extract_pv_list(cde_item.get(CDE_PV_NAME)) - } - return cde_record - -def compose_synonym_record(cde_item, synonym_set): - """ - compose synonym record from cde dump file - """ - pv_list = cde_item.get(CDE_PV_NAME) - if pv_list: - for pv_item in pv_list: - synonyms = pv_item.get(NCIT_SYNONYMS) - if synonyms: - for synonym in synonyms: - if synonym: - synonym_key = (synonym, pv_item.get(NCIT_VALUE)) - if synonym_key in synonym_set: - continue - synonym_set.add(synonym_key) - return - -def compose_concept_code_record(cde_item, concept_code_set): - """ - compose concept code record from cde dump file - """ - pv_list = cde_item.get(CDE_PV_NAME) - cde_code = cde_item.get(CDE_CODE) - if pv_list: - for pv in pv_list: - value = pv.get(NCIT_VALUE) - concept_code = pv.get(NCIT_CDE_CONCEPT_CODE) - if concept_code: - concept_code_key = (cde_code, value, concept_code) - if concept_code_key in concept_code_set: - continue - concept_code_set.add(concept_code_key) - return - -def get_pv_by_code_version(configs, log, cde_code, cde_version, mongo_dao): - """ - get permissive values by cde code and version in real time - :param cde_code: cde code - :param cde_version: cde version - """ - msg = None - if cde_code is None: - msg = "CDE code is required." - log.error(f"Invalid CDE code.") - return None - cde_records = [] - api_client = APIInvoker(configs) - resource = configs[STS_DATA_RESOURCE_CONFIG] if configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_API - # resource = configs[STS_DATA_RESOURCE_CONFIG] if configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_FILE - if resource == STS_DATA_RESOURCE_API: - sts_api_url = configs[STS_API_ONE_URL] - if not sts_api_url: - msg = "STS API url is not configured." - log.error(f"Invalid STS API URL.") - return None - if not cde_version: - sts_api_url = sts_api_url.replace("/{cde_version}", "") - sts_api_url = sts_api_url.format(cde_code=cde_code) - cde_version = None - else: - sts_api_url = sts_api_url.format(cde_code=cde_code, cde_version=cde_version) - log.info(f"Retrieving cde from {sts_api_url} for {cde_code}/{cde_version}...") - try: - results = api_client.get_all_data_elements(sts_api_url) - cde_records, _, _ = process_sts_cde_pv(results, log, True) - if not cde_records or len(cde_records) == 0: - msg = f"No CDE found for {cde_code}/{cde_version}." - log.info(msg) - return None - except Exception as e: - log.exception(e) - msg = f"Failed to retrieve CDE PVs for {cde_code}/{cde_version}." - log.exception(msg) - return None - except Exception as e: - log.exception(e) - msg = f"Failed to retrieve CDE PVs for {cde_code}/{cde_version}." - log.exception(msg) - return None - else: - sts_file_url = configs[STS_DUMP_CONFIG].format(configs[TIER_CONFIG]) - log.info(f"Retrieving cde from {sts_file_url} for {cde_code}/{cde_version}...") - try: - results = api_client.get_synonyms(sts_file_url) - cde_records, _, _ = process_sts_cde_pv(results, log, True) - - except Exception as e: - log.exception(e) - msg = f"Failed to retrieve CDE PVs for {cde_code}/{cde_version}." - log.exception(msg) - return None - - if not cde_records or len(cde_records) == 0: - msg = f"No CDE found for {cde_code}/{cde_version}." - log.info(msg) - return None - log.info(f"{len(cde_records)} unique CDE are retrieved!") - cde_record = next((item for item in cde_records if item[CDE_CODE] == cde_code and item[CDE_VERSION] == cde_version), None) - if not cde_record: - msg = f"No CDE found for {cde_code}/{cde_version}." - log.info(msg) - return None - if not cde_records or len(cde_records) == 0: - msg = f"No CDE found for {cde_code}/{cde_version}." - log.info(msg) - return None - log.info(f"{len(cde_records)} unique CDE are retrieved!") - cde_record = next((item for item in cde_records if item[CDE_CODE] == cde_code and item[CDE_VERSION] == cde_version), None) - if not cde_record: - msg = f"No CDE found for {cde_code}/{cde_version}." - log.info(msg) - return None - log.info(f"Retrieved CDE for {cde_code}/{cde_version}.") - # save cde pv to db - result, _ = mongo_dao.upsert_cde([cde_record]) - if result: - log.info(f"CED PV are pulled and save successfully!") - else: - log.error(f"Failed to pull and save CDE PV! {msg}") - return cde_record \ No newline at end of file diff --git a/src/pv_puller_v2.py b/src/pv_puller_v2.py new file mode 100644 index 00000000..6e16b155 --- /dev/null +++ b/src/pv_puller_v2.py @@ -0,0 +1,329 @@ +#!/usr/bin/env python3 +from bento.common.utils import get_logger +from common.constants import STS_API_ALL_URL_V2, STS_API_ONE_URL_V2, \ + PROPERTY_PERMISSIBLE_VALUES, STS_DATA_RESOURCE_CONFIG, STS_DATA_RESOURCE_API, DATA_COMMONS_LIST, HIDDEN_MODELS, KEY, PROPERTY, MODEL, VERSION +from common.utils import get_exception_msg +from common.api_client import APIInvoker +import re + +MODEL_DEFS = "models" +CADSR_DATA_ELEMENT = "DataElement" +CADSR_VALUE_DOMAIN = "ValueDomain" +CADSR_DATA_ELEMENT_LONG_NAME = "longName" +CADSR_PERMISSIVE_VALUES = "PermissibleValues" +FILE_DOWNLOAD_URL = "download_url" +FILE_NAME = "name" +FILE_TYPE = "type" +PROPERTY_PV_NAME = "permissibleValues" +NCIT_PROPERTY_CONCEPT_CODE = "ncit_concept_code" +NCIT_SYNONYMS = "synonyms" +NCIT_VALUE = "value" + +def pull_pv_lists_v2(configs, mongo_dao): + """ + Pull permissible values and synonyms from STS and save them to the database. + + :param configs: Configuration settings for the puller. + :param mongo_dao: Data access object for MongoDB operations. + """ + log = get_logger('Permissive values and synonym puller') + api_client = APIInvoker(configs) + pv_puller = PVPullerV2(configs, mongo_dao, api_client) + # synonym_puller = SynonymPuller(configs, mongo_dao, api_client) + + try: + # pull pv, property, synonym, concept codes + pv_puller.pull_property_pv_synonym_concept_codes() + except (KeyboardInterrupt, SystemExit): + print("Task is stopped...") + except Exception as e: + log.critical(e) + log.critical( + f'Something wrong happened while pulling permissive values! Check debug log for details.') +class PVPullerV2: + """ + Class for pulling permissible values from STS and saving them to the database. + """ + def __init__(self, configs, mongo_dao, api_client): + self.log = get_logger('Permissive values puller') + self.mongo_dao = mongo_dao + self.configs = configs + self.api_client = api_client + self.config_model_list = self.mongo_dao.get_configuration_by_ev_var([DATA_COMMONS_LIST, HIDDEN_MODELS]) + if self.config_model_list is not None: + if len(self.config_model_list) == 2: #if both data commons list and hidden models are configured + self.pv_models = [x for x in self.config_model_list[0][KEY] if x not in self.config_model_list[1][KEY]] + else: + self.pv_models = [] + else: + self.pv_models = [] + + def pull_property_pv_synonym_concept_codes(self): + """ + pull property pv from STS API (STS_API_ALL_URL_V2) and save to db + """ + resource = self.configs[STS_DATA_RESOURCE_CONFIG] if self.configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_API + # resource = self.configs[STS_DATA_RESOURCE_CONFIG] if self.configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_FILE + try: + property_records, synonym_records, concept_codes_records = retrieveAllPropertyViaAPI(self.configs, self.pv_models, self.log, self.api_client) + if not property_records or len(property_records) == 0: + self.log.info("No property found!") + return + self.log.info(f"{len(property_records)} unique property are retrieved!") + result, msg = self.mongo_dao.upsert_property_pv(list(property_records)) + if result: + self.log.info(f"Property PV are pulled and save successfully!") + else: + self.log.error(f"Failed to pull and save Property PV! {msg}") + + if not synonym_records or len(synonym_records) == 0: + self.log.info("No synonym found!") + return + self.log.info(f"{len(synonym_records)} unique synonyms are retrieved!") + result = self.mongo_dao.insert_synonyms(list(synonym_records)) + if result is not None: + self.log.info(f"Property Synonyms are pulled and save successfully!") + + if not concept_codes_records or len(concept_codes_records) == 0: + self.log.info("No concept code found!") + return + self.log.info(f"{len(concept_codes_records)} unique concept codes are retrieved!") + result = self.mongo_dao.insert_concept_codes_v2(list(concept_codes_records)) + if result is not None: + self.log.info(f"Property Concept Codes are pulled and save successfully!") + self.log.info(f"All property PVs, Synonyms and Concept Codes are pulled and saved successfully!") + return + except Exception as e: + self.log.exception(e) + self.log.exception(f"Failed to retrieve property PVs.") + +def retrieveAllPropertyViaAPI(configs, pv_models, log, api_client=None): + """ + extract property from api + """ + sts_api_url_list = [] + if len(pv_models) > 0: + for pv_model in pv_models: + sts_api_url_list.append(f"{configs[STS_API_ALL_URL_V2]}/{pv_model}") + else: + raise Exception("No model configured for pulling property PVs.") + log.info(f"Retrieving property from {sts_api_url_list}...") + if not api_client: + api_client = APIInvoker(configs) + results_list = api_client.get_all_data_elements_v2(sts_api_url_list) + results = [] + for result in results_list: + if result is not None: + results.extend(result) + if not results or len(results) == 0: + log.error(f"No property/pvs retrieve from STS API, {sts_api_url_list}.") + return None, None, None + property_records, synonym_set, concept_code_set = process_sts_property_pv(results, log) + log.info(f"Retrieved property PVs from {sts_api_url_list}.") + return property_records, synonym_set, concept_code_set + + +def process_sts_property_pv(sts_results, log, property_only=False): + """ + get property pv from sts api + :param sts_api_url: sts api url + """ + property_set = set() + property_records = [] + synonym_set = set() + concept_code_set = set() + if not sts_results or len(sts_results) == 0: + log.error(f"No property/pvs retrieve from STS API.") + return None, None, None + property_list = [item for item in sts_results if item.get(PROPERTY) and item.get(PROPERTY) != 'null'] + if not property_list or len(property_list) == 0: + log.error(f"No property found in STS API results.") + return None, None, None + for item in property_list: + property_name = item.get(PROPERTY) + model = item.get(MODEL) if item.get(MODEL) and item.get(MODEL) != 'null' else None + version = item.get(VERSION) if item.get(VERSION) and item.get(VERSION) != 'null' else None + version = re.match(r'[\d.]+', version).group() + property_key = (property_name, model, version) + if property_key in property_set: + continue + property_set.add(property_key) + property_record = compose_property_record(item) + property_records.append(property_record) + if property_only: + continue + # extract synonyms + if item.get(PROPERTY_PV_NAME) and len(item.get(PROPERTY_PV_NAME)) > 0 and item.get(PROPERTY_PV_NAME)[0].get(NCIT_SYNONYMS): + compose_synonym_record(item, synonym_set) + + # extract concept codes + if item.get(PROPERTY_PV_NAME) and len(item.get(PROPERTY_PV_NAME)) > 0 and item.get(PROPERTY_PV_NAME)[0].get(NCIT_PROPERTY_CONCEPT_CODE): + compose_concept_code_record(item, concept_code_set) + + return property_records, synonym_set, concept_code_set + +def extract_pv_list(property_pv_list): + """ + extract pv list from property pv list + """ + pv_list = [] + if property_pv_list is None: + pv_list = None + if property_pv_list and any(item.get(NCIT_VALUE) for item in property_pv_list): + pv_list = [item[NCIT_VALUE] for item in property_pv_list if NCIT_VALUE in item and item[NCIT_VALUE] is not None] + # strip white space if the value is a string + if pv_list and isinstance(pv_list[0], str): + pv_list = [item.strip() for item in pv_list] + + return pv_list + +def compose_property_record(property_item): + """ + compose property record from property item + """ + property_record = { + PROPERTY: property_item.get(PROPERTY), + MODEL: property_item.get(MODEL), + VERSION: re.match(r'[\d.]+', property_item.get(VERSION)).group() if property_item.get(VERSION) and property_item.get(VERSION) != 'null' else None, + PROPERTY_PERMISSIBLE_VALUES: extract_pv_list(property_item.get(PROPERTY_PV_NAME)) + } + return property_record + +def compose_synonym_record(property_item, synonym_set): + """ + compose synonym record from property item + Synonym terms are normalized to lowercase before deduplication and DB insert. + """ + pv_list = property_item.get(PROPERTY_PV_NAME) + if pv_list: + for pv_item in pv_list: + synonyms = pv_item.get(NCIT_SYNONYMS) + if synonyms: + for synonym in synonyms: + if synonym: + term = str(synonym).strip().lower() + if not term: + continue + synonym_key = (term, pv_item.get(NCIT_VALUE)) + if synonym_key in synonym_set: + continue + synonym_set.add(synonym_key) + return + +def compose_concept_code_record(property_item, concept_code_set): + """ + compose concept code record from property item + """ + pv_list = property_item.get(PROPERTY_PV_NAME) + model = property_item.get(MODEL) + property_name = property_item.get(PROPERTY) + if pv_list: + for pv in pv_list: + value = pv.get(NCIT_VALUE) + concept_code = pv.get(NCIT_PROPERTY_CONCEPT_CODE) + if concept_code: + concept_code_key = (model, property_name, value, concept_code) + if concept_code_key in concept_code_set: + continue + concept_code_set.add(concept_code_key) + return + +def get_pv_by_property_version(configs, log, prop, prop_version, prop_model, mongo_dao): + """ + get all permissive values by property,version, and model + """ + msg = None + prop_records = [] + api_client = APIInvoker(configs) + #resource = configs[STS_DATA_RESOURCE_CONFIG] if configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_API + # resource = configs[STS_DATA_RESOURCE_CONFIG] if configs.get(STS_DATA_RESOURCE_CONFIG) else STS_DATA_RESOURCE_FILE + sts_api_url = configs[STS_API_ONE_URL_V2] + if not sts_api_url: + msg = "STS API url is not configured." + log.error(f"Invalid STS API URL.") + return None + if not prop_version: + sts_api_url = sts_api_url.replace("/version={prop_version}", "") + sts_api_url = sts_api_url.format(property=prop) + prop_version = None + else: + sts_api_url = sts_api_url.format(model=prop_model, property=prop, version=prop_version) + log.info(f"Retrieving property values from {sts_api_url} for {prop}/{prop_version}...") + try: + results = api_client.get_all_data_elements(sts_api_url) + prop_records, _, _ = process_sts_property_pv(results, log, True) + if not prop_records or len(prop_records) == 0: + msg = f"No property found for {prop}/{prop_version}." + log.info(msg) + return None + except Exception as e: + log.exception(e) + msg = f"Failed to retrieve property PVs for {prop}/{prop_version}." + log.exception(msg) + return None + except Exception as e: + log.exception(e) + msg = f"Failed to retrieve property PVs for {prop}/{prop_version}." + log.exception(msg) + return None + + if not prop_records or len(prop_records) == 0: + msg = f"No property found for {prop}/{prop_version}." + log.info(msg) + return None + log.info(f"{len(prop_records)} unique properties are retrieved!") + prop_record = next((item for item in prop_records if item[PROPERTY] == prop and item[MODEL] == prop_model and item[VERSION] == prop_version), None) + if not prop_record: + msg = f"No property found for {prop}/{prop_version}." + log.info(msg) + return None + if not prop_records or len(prop_records) == 0: + msg = f"No property found for {prop}/{prop_version}." + log.info(msg) + return None + log.info(f"{len(prop_records)} unique properties are retrieved!") + log.info(f"Retrieved property for {prop}/{prop_version}.") + # save property pv to db + result, _ = mongo_dao.upsert_property_pv([prop_record]) + get_all_pvs_by_version(configs, log, prop_version, prop_model, mongo_dao) + if result: + log.info(f"Property PV are pulled and save successfully!") + else: + log.error(f"Failed to pull and save property PV! {msg}") + return prop_record + +def get_all_pvs_by_version(configs, log, version, model, mongo_dao): + api_client = APIInvoker(configs) + sts_api_url = configs[STS_API_ONE_URL_V2] + if not sts_api_url: + log.error(f"Invalid STS API URL.") + return None + sts_api_url = f"{configs[STS_API_ALL_URL_V2]}/{model}/?version={version}" + results = api_client.get_all_data_elements(sts_api_url) + prop_records, synonym_records, concept_codes_records = process_sts_property_pv(results, log, False) + result_pv, _ = mongo_dao.upsert_property_pv(prop_records) + if result_pv: + log.info(f"Property PV for {model}/{version} are pulled and save successfully!") + else: + log.error(f"Failed to pull and save Property PV!") + if not prop_records or len(prop_records) == 0: + log.info("No property found!") + return None + + if not synonym_records or len(synonym_records) == 0: + log.info("No synonym found!") + log.info(f"{len(synonym_records)} unique synonyms are retrieved!") + result_synonyms = mongo_dao.insert_synonyms(list(synonym_records)) + if result_synonyms is not None: + log.info(f"Property Synonyms for {model}/{version} are pulled and save successfully!") + + if not concept_codes_records or len(concept_codes_records) == 0: + log.info("No concept code found!") + log.info(f"{len(concept_codes_records)} unique concept codes are retrieved!") + result_concept_codes = mongo_dao.insert_concept_codes_v2(list(concept_codes_records)) + if result_concept_codes is not None: + log.info(f"Property Concept Codes for {model}/{version} are pulled and save successfully!") + log.info(f"All property PVs, Synonyms and Concept Codes for {model}/{version} are pulled and saved successfully!") + + + diff --git a/src/unit_test/__init__.py b/src/test/__init__.py similarity index 100% rename from src/unit_test/__init__.py rename to src/test/__init__.py diff --git a/src/unit_test/file_records.json b/src/test/file_records.json similarity index 100% rename from src/unit_test/file_records.json rename to src/test/file_records.json diff --git a/src/test/test_mongo_dao_find_organization.py b/src/test/test_mongo_dao_find_organization.py new file mode 100644 index 00000000..b3fbd63d --- /dev/null +++ b/src/test/test_mongo_dao_find_organization.py @@ -0,0 +1,130 @@ +import pytest +from unittest.mock import MagicMock, patch +from common.mongo_dao import MongoDao +from common.constants import STUDY_COLLECTION, ORGANIZATION_COLLECTION +from pymongo import errors + + +def _setup_mock_db(mock_client_class): + mock_client = MagicMock() + mock_client_class.return_value = mock_client + mock_db = MagicMock() + mock_client.__getitem__.return_value = mock_db + mock_study_collection = MagicMock() + mock_org_collection = MagicMock() + + def db_getitem(key): + if key == STUDY_COLLECTION: + return mock_study_collection + if key == ORGANIZATION_COLLECTION: + return mock_org_collection + return MagicMock() + + mock_db.__getitem__.side_effect = db_getitem + return mock_study_collection, mock_org_collection + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_success(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + + study_id = "study_123" + program_id = "program_456" + org_name = "Test Organization" + + study_collection.find_one.return_value = {"_id": study_id, "programID": program_id} + org_collection.find_one.return_value = {"_id": program_id, "name": org_name} + + dao = MongoDao("mongodb://localhost:27017", "test_db") + result = dao.find_organization_name_by_study_id(study_id) + + assert result == [org_name] + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_with_special_chars(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + + special_name = "University of Science & Technology (UST) - Division" + study_collection.find_one.return_value = {"_id": "study_123", "programID": "prog_1"} + org_collection.find_one.return_value = {"_id": "prog_1", "name": special_name} + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") == [special_name] + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_study_not_found(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = None + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("nonexistent_study") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_program_not_found(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = {"_id": "study_1", "programID": "p1"} + org_collection.find_one.return_value = None + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_1") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_pymongo_error_on_study(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.side_effect = errors.PyMongoError("Connection error") + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_pymongo_error_on_organization(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = {"_id": "study_123", "programID": "p1"} + org_collection.find_one.side_effect = errors.PyMongoError("Connection error") + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_generic_exception_on_study(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.side_effect = Exception("Unexpected error") + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_generic_exception_on_organization(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = {"_id": "study_123", "programID": "p1"} + org_collection.find_one.side_effect = Exception("Unexpected error") + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") is None + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_empty_organization_name(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = {"_id": "study_123", "programID": "p1"} + org_collection.find_one.return_value = {"_id": "p1", "name": ""} + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_123") == [""] + + +@patch("common.mongo_dao.MongoClient") +def test_find_organization_name_multiple_studies_same_program(mock_client_class): + study_collection, org_collection = _setup_mock_db(mock_client_class) + study_collection.find_one.return_value = {"_id": "study_1", "programID": "program_shared"} + org_collection.find_one.return_value = {"_id": "program_shared", "name": "Shared Organization"} + + dao = MongoDao("mongodb://localhost:27017", "test_db") + assert dao.find_organization_name_by_study_id("study_1") == ["Shared Organization"] \ No newline at end of file diff --git a/src/test/test_mongo_dao_set_submission_validation_status.py b/src/test/test_mongo_dao_set_submission_validation_status.py new file mode 100644 index 00000000..cef11b37 --- /dev/null +++ b/src/test/test_mongo_dao_set_submission_validation_status.py @@ -0,0 +1,126 @@ +"""Unit tests for MongoDao.set_submission_validation_status, especially scope='new' only-update-if-worse behavior.""" +import pytest +from unittest.mock import MagicMock, patch + +from common.mongo_dao import MongoDao +from common.constants import ( + SUBMISSION_COLLECTION, + DATA_COLLECTION, + ID, + SUBMISSION_ID, + METADATA_VALIDATION_STATUS, + VALIDATION_ENDED, + FILE_ERRORS, + STATUS_ERROR, + STATUS_PASSED, + STATUS_WARNING, +) + + +def _setup_mock_db(mock_client_class): + mock_client = MagicMock() + mock_client_class.return_value = mock_client + mock_db = MagicMock() + mock_client.__getitem__.return_value = mock_db + mock_submission_collection = MagicMock() + mock_submission_collection.update_one.return_value = MagicMock(matched_count=1) + + def db_getitem(key): + if key == SUBMISSION_COLLECTION: + return mock_submission_collection + return MagicMock() + + mock_db.__getitem__.side_effect = db_getitem + return mock_submission_collection + + +@patch("common.mongo_dao.MongoClient") +def test_scope_new_current_worse_than_new_do_not_overwrite(mock_client_class): + """scope='new', current status Error, new Passed: do not overwrite (keep Error).""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + with patch.object(dao, "count_docs", return_value=0): + submission = {ID: "sub_1", METADATA_VALIDATION_STATUS: STATUS_ERROR} + dao.set_submission_validation_status( + submission, None, STATUS_PASSED, None, None, status_detail=None, scope="new" + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[METADATA_VALIDATION_STATUS] == STATUS_ERROR + + +@patch("common.mongo_dao.MongoClient") +def test_scope_new_new_worse_than_current_update(mock_client_class): + """scope='new', current Passed, new Error: update to Error.""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + with patch.object(dao, "count_docs", return_value=0): + submission = {ID: "sub_1", METADATA_VALIDATION_STATUS: STATUS_PASSED} + dao.set_submission_validation_status( + submission, None, STATUS_ERROR, None, None, status_detail=None, scope="new" + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[METADATA_VALIDATION_STATUS] == STATUS_ERROR + + +@patch("common.mongo_dao.MongoClient") +def test_scope_new_current_none_treat_as_passed_passed(mock_client_class): + """scope='new', no current status (treated as Passed): write Passed.""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + with patch.object(dao, "count_docs", return_value=0): + submission = {ID: "sub_1"} + dao.set_submission_validation_status( + submission, None, STATUS_PASSED, None, None, status_detail=None, scope="new" + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[METADATA_VALIDATION_STATUS] == STATUS_PASSED + + +@patch("common.mongo_dao.MongoClient") +def test_scope_new_current_none_treat_as_passed_warning(mock_client_class): + """scope='new', no current status (treated as Passed): write Warning when new is Warning.""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + with patch.object(dao, "count_docs", return_value=0): + submission = {ID: "sub_1"} + dao.set_submission_validation_status( + submission, None, STATUS_WARNING, None, None, status_detail=None, scope="new" + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[METADATA_VALIDATION_STATUS] == STATUS_WARNING + + +@patch("common.mongo_dao.MongoClient") +def test_file_errors_persisted_without_file_status(mock_client_class): + """fileErrors updates FILE_ERRORS when file_status is None (e.g. delete-metadata orphan F008).""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + file_err = [{"submittedID": "orphan.csv", "errors": [{"code": "F008"}]}] + with patch.object(dao, "count_docs", return_value=0): + with patch.object(dao.s3_service, "submissionHasDataFile", return_value=False): + submission = {ID: "sub_1", METADATA_VALIDATION_STATUS: STATUS_PASSED} + dao.set_submission_validation_status( + submission, None, STATUS_PASSED, None, file_err, is_delete=True + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[FILE_ERRORS] == file_err + + +@patch("common.mongo_dao.MongoClient") +def test_scope_all_overwrites_regardless(mock_client_class): + """scope='all': overwrite current status (no only-if-worse rule).""" + mock_submission_collection = _setup_mock_db(mock_client_class) + dao = MongoDao("mongodb://localhost:27017", "test_db") + with patch.object(dao, "count_docs", return_value=0): + submission = {ID: "sub_1", METADATA_VALIDATION_STATUS: STATUS_ERROR} + dao.set_submission_validation_status( + submission, None, STATUS_PASSED, None, None, status_detail=None, scope="all" + ) + update_one_call = mock_submission_collection.update_one.call_args + set_payload = update_one_call[0][1]["$set"] + assert set_payload[METADATA_VALIDATION_STATUS] == STATUS_PASSED diff --git a/src/test/test_pv_puller_v2.py b/src/test/test_pv_puller_v2.py new file mode 100644 index 00000000..764e920f --- /dev/null +++ b/src/test/test_pv_puller_v2.py @@ -0,0 +1,1131 @@ +import pytest +from unittest.mock import MagicMock, patch, call +from pv_puller_v2 import ( + PVPullerV2, + pull_pv_lists_v2, + retrieveAllPropertyViaAPI, + process_sts_property_pv, + extract_pv_list, + compose_property_record, + compose_synonym_record, + compose_concept_code_record, + get_pv_by_property_version, + get_all_pvs_by_version +) +from common.constants import ( + STS_API_ALL_URL_V2, + STS_API_ONE_URL_V2, + PROPERTY_PERMISSIBLE_VALUES, + STS_DATA_RESOURCE_CONFIG, + STS_DATA_RESOURCE_API, + DATA_COMMONS_LIST, + HIDDEN_MODELS, + KEY, + PROPERTY, + MODEL, + VERSION +) + + +# ==================== Fixtures ==================== + +@pytest.fixture +def mock_mongo_dao(): + """Mock MongoDB Data Access Object""" + dao = MagicMock() + dao.get_configuration_by_ev_var.return_value = [ + {KEY: ["model1", "model2", "model3"]}, + {KEY: ["model3"]} + ] + return dao + + +@pytest.fixture +def mock_api_client(): + """Mock API Client""" + return MagicMock() + + +@pytest.fixture +def mock_configs(): + """Mock configuration dictionary""" + return { + STS_API_ALL_URL_V2: "https://sts-api.example.com/v1/property", + STS_API_ONE_URL_V2: "https://sts-api.example.com/v1/property/{property}/version={version}", + STS_DATA_RESOURCE_CONFIG: "api", + } + + +@pytest.fixture +def mock_logger(): + """Mock logger""" + logger = MagicMock() + return logger + + +# ==================== Test PVPullerV2 Class ==================== + +@patch('pv_puller_v2.get_logger') +def test_pv_puller_v2_init_with_data_commons_and_hidden_models(mock_get_logger, mock_mongo_dao, mock_api_client, mock_configs): + """Test PVPullerV2 initialization with data commons list and hidden models""" + mock_get_logger.return_value = MagicMock() + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + + assert puller.configs == mock_configs + assert puller.mongo_dao == mock_mongo_dao + assert puller.api_client == mock_api_client + assert puller.pv_models == ["model1", "model2"] # model3 is hidden + + +@patch('pv_puller_v2.get_logger') +def test_pv_puller_v2_init_no_configuration(mock_get_logger, mock_api_client, mock_configs): + """Test PVPullerV2 initialization with no configuration""" + mock_mongo_dao = MagicMock() + mock_mongo_dao.get_configuration_by_ev_var.return_value = None + mock_get_logger.return_value = MagicMock() + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + + assert puller.pv_models == [] + + +@patch('pv_puller_v2.get_logger') +def test_pv_puller_v2_init_only_one_config(mock_get_logger, mock_api_client, mock_configs): + """Test PVPullerV2 initialization with only one configuration item""" + mock_mongo_dao = MagicMock() + mock_mongo_dao.get_configuration_by_ev_var.return_value = [{KEY: ["model1", "model2"]}] + mock_get_logger.return_value = MagicMock() + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + + assert puller.pv_models == [] + + +@patch('pv_puller_v2.retrieveAllPropertyViaAPI') +@patch('pv_puller_v2.get_logger') +def test_pull_property_pv_synonym_concept_codes_success( + mock_get_logger, + mock_retrieve_api, + mock_mongo_dao, + mock_api_client, + mock_configs +): + """Test successful retrieval and storage of properties, PVs, synonyms, and concept codes""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + # Mock successful API retrieval + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1", "val2"]}, + {PROPERTY: "prop2", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val3"]} + ] + synonym_records = {("synonym1", "val1"), ("synonym2", "val2")} + concept_code_records = {("model1", "prop1", "val1", "code1")} + + mock_retrieve_api.return_value = (property_records, synonym_records, concept_code_records) + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = True + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + puller.pull_property_pv_synonym_concept_codes() + + # Verify API was called + mock_retrieve_api.assert_called_once() + + # Verify MongoDB upsert was called + mock_mongo_dao.upsert_property_pv.assert_called_once_with(property_records) + + # Verify synonyms were inserted + mock_mongo_dao.insert_synonyms.assert_called_once_with(list(synonym_records)) + + # Verify concept codes were inserted + mock_mongo_dao.insert_concept_codes_v2.assert_called_once_with(list(concept_code_records)) + + +@patch('pv_puller_v2.retrieveAllPropertyViaAPI') +@patch('pv_puller_v2.get_logger') +def test_pull_property_pv_synonym_concept_codes_no_properties( + mock_get_logger, + mock_retrieve_api, + mock_mongo_dao, + mock_api_client, + mock_configs +): + """Test behavior when no properties are retrieved""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + mock_retrieve_api.return_value = (None, None, None) + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + puller.pull_property_pv_synonym_concept_codes() + + # Verify no database operations when no properties + mock_mongo_dao.upsert_property_pv.assert_not_called() + + +@patch('pv_puller_v2.retrieveAllPropertyViaAPI') +@patch('pv_puller_v2.get_logger') +def test_pull_property_pv_synonym_concept_codes_upsert_failure( + mock_get_logger, + mock_retrieve_api, + mock_mongo_dao, + mock_api_client, + mock_configs +): + """Test handling of upsert failure""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + property_records = [{PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0"}] + + mock_retrieve_api.return_value = (property_records, set(), set()) + mock_mongo_dao.upsert_property_pv.return_value = (False, "Database error") + + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + puller.pull_property_pv_synonym_concept_codes() + + # Verify error logging + mock_logger.error.assert_called() + + +# ==================== Test retrieveAllPropertyViaAPI ==================== + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_retrieve_all_property_via_api_success( + mock_process_sts, + mock_api_invoker_class, + mock_configs, + mock_logger +): + """Test successful API retrieval of all properties""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + api_results = [{"property": "prop1"}, {"property": "prop2"}] + processed_records = ( + [{"property": "prop1"}, {"property": "prop2"}], + set(), + set() + ) + + mock_api_client.get_all_data_elements_v2.return_value = [api_results] + mock_process_sts.return_value = processed_records + + pv_models = ["model1", "model2"] + result = retrieveAllPropertyViaAPI(mock_configs, pv_models, mock_logger) + + assert result == processed_records + mock_api_client.get_all_data_elements_v2.assert_called_once() + + +@patch('pv_puller_v2.process_sts_property_pv') +def test_retrieve_all_property_via_api_with_explicit_client( + mock_process_sts, + mock_api_client, + mock_configs, + mock_logger +): + """Test API retrieval with explicitly provided client""" + api_results = [{"property": "prop1"}] + processed_records = ([{"property": "prop1"}], set(), set()) + + mock_api_client.get_all_data_elements_v2.return_value = [api_results] + mock_process_sts.return_value = processed_records + + pv_models = ["model1"] + result = retrieveAllPropertyViaAPI(mock_configs, pv_models, mock_logger, mock_api_client) + + assert result == processed_records + + +@patch('pv_puller_v2.process_sts_property_pv') +def test_retrieve_all_property_via_api_empty_models( + mock_process_sts, + mock_api_client, + mock_configs, + mock_logger +): + """Test error when no models are configured""" + with pytest.raises(Exception) as exc_info: + retrieveAllPropertyViaAPI(mock_configs, [], mock_logger, mock_api_client) + + assert "No model configured" in str(exc_info.value) + + +@patch('pv_puller_v2.process_sts_property_pv') +def test_retrieve_all_property_via_api_no_results( + mock_process_sts, + mock_api_client, + mock_configs, + mock_logger +): + """Test handling when API returns no results""" + mock_api_client.get_all_data_elements_v2.return_value = [None, None] + + pv_models = ["model1", "model2"] + + # Should return None from process_sts_property_pv when no results + mock_process_sts.return_value = (None, None, None) + + result = retrieveAllPropertyViaAPI(mock_configs, pv_models, mock_logger, mock_api_client) + + assert result == (None, None, None) + + +# ==================== Test process_sts_property_pv ==================== + +def test_process_sts_property_pv_success(mock_logger): + """Test successful processing of STS API results""" + sts_results = [ + { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.0.0", + "permissibleValues": [ + { + "value": "value1", + "ncit_concept_code": "C12345", + "synonyms": ["syn1", "syn2"] + } + ] + } + ] + + property_records, synonym_set, concept_code_set = process_sts_property_pv(sts_results, mock_logger) + + assert len(property_records) == 1 + assert property_records[0][PROPERTY] == "Property1" + assert len(synonym_set) == 2 + assert len(concept_code_set) == 1 + + +def test_process_sts_property_pv_empty_results(mock_logger): + """Test processing with empty results""" + result = process_sts_property_pv([], mock_logger) + + assert result == (None, None, None) + + +def test_process_sts_property_pv_null_property(mock_logger): + """Test processing with null property values""" + sts_results = [ + { + PROPERTY: "null", + MODEL: "Model1", + VERSION: "1.0.0" + } + ] + + result = process_sts_property_pv(sts_results, mock_logger) + + assert result == (None, None, None) + + +def test_process_sts_property_pv_duplicate_properties(mock_logger): + """Test that duplicate properties are not added to records""" + sts_results = [ + { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.0.0", + "permissibleValues": [] + }, + { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.0.0", + "permissibleValues": [] + } + ] + + property_records, _, _ = process_sts_property_pv(sts_results, mock_logger) + + assert len(property_records) == 1 + + +def test_process_sts_property_pv_property_only_flag(mock_logger): + """Test processing with property_only flag""" + sts_results = [ + { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.0.0", + "permissibleValues": [ + { + "value": "value1", + "synonyms": ["syn1"] + } + ] + } + ] + + property_records, synonym_set, concept_code_set = process_sts_property_pv( + sts_results, mock_logger, property_only=True + ) + + assert len(property_records) == 1 + assert len(synonym_set) == 0 + assert len(concept_code_set) == 0 + + +# ==================== Test extract_pv_list ==================== + +def test_extract_pv_list_valid_values(): + """Test extraction of valid permissible values""" + pv_list = [ + {"value": "value1"}, + {"value": "value2"}, + {"value": "value3"} + ] + + result = extract_pv_list(pv_list) + + assert result == ["value1", "value2", "value3"] + + +def test_extract_pv_list_with_whitespace(): + """Test extraction with whitespace stripping""" + pv_list = [ + {"value": " value1 "}, + {"value": "value2\n"} + ] + + result = extract_pv_list(pv_list) + + assert result == ["value1", "value2"] + + +def test_extract_pv_list_empty(): + """Test extraction from empty list""" + result = extract_pv_list([]) + + assert result == [] + + +def test_extract_pv_list_none_values(): + """Test extraction with None values""" + pv_list = [ + {"value": "value1"}, + {"value": None}, + {"value": "value2"} + ] + + result = extract_pv_list(pv_list) + + assert result == ["value1", "value2"] + + +# ==================== Test compose_property_record ==================== + +def test_compose_property_record_complete(): + """Test composing a complete property record""" + property_item = { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.0.0", + "permissibleValues": [ + {"value": "val1"}, + {"value": "val2"} + ] + } + + record = compose_property_record(property_item) + + assert record[PROPERTY] == "Property1" + assert record[MODEL] == "Model1" + assert record[VERSION] == "1.0.0" + assert len(record[PROPERTY_PERMISSIBLE_VALUES]) == 2 + + +def test_compose_property_record_with_complex_version(): + """Test version extraction from complex version string""" + property_item = { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "1.2.3.4.alpha.rc1", + "permissibleValues": [] + } + + record = compose_property_record(property_item) + + # The regex [\d.]+ matches all consecutive digits and dots + assert record[VERSION] == "1.2.3.4." + + +def test_compose_property_record_null_version(): + """Test handling of null version""" + property_item = { + PROPERTY: "Property1", + MODEL: "Model1", + VERSION: "null", + "permissibleValues": [] + } + + record = compose_property_record(property_item) + + assert record[VERSION] is None + + +# ==================== Test compose_synonym_record ==================== + +def test_compose_synonym_record_success(): + """Test successful synonym record composition""" + property_item = { + "permissibleValues": [ + { + "value": "val1", + "synonyms": ["syn1", "syn2"] + }, + { + "value": "val2", + "synonyms": ["syn3"] + } + ] + } + + synonym_set = set() + compose_synonym_record(property_item, synonym_set) + + assert len(synonym_set) == 3 + assert ("syn1", "val1") in synonym_set + assert ("syn2", "val1") in synonym_set + assert ("syn3", "val2") in synonym_set + + +def test_compose_synonym_record_no_synonyms(): + """Test when there are no synonyms""" + property_item = { + "permissibleValues": [ + {"value": "val1"} + ] + } + + synonym_set = set() + compose_synonym_record(property_item, synonym_set) + + assert len(synonym_set) == 0 + + +def test_compose_synonym_record_duplicate_synonyms(): + """Test that duplicate synonyms are not added""" + property_item = { + "permissibleValues": [ + { + "value": "val1", + "synonyms": ["syn1", "syn1"] + } + ] + } + + synonym_set = set() + compose_synonym_record(property_item, synonym_set) + + # Sets automatically handle duplicates + assert len(synonym_set) == 1 + + +def test_compose_synonym_record_lowercases_term(): + """Synonym terms are stored lowercase; case variants dedupe.""" + property_item = { + "permissibleValues": [ + { + "value": "val1", + "synonyms": ["Foo", " FOO ", "foo"] + } + ] + } + synonym_set = set() + compose_synonym_record(property_item, synonym_set) + assert len(synonym_set) == 1 + assert ("foo", "val1") in synonym_set + + +def test_compose_synonym_record_none_pv_list(): + """Test with None permissibleValues list""" + property_item = {"permissibleValues": None} + + synonym_set = set() + compose_synonym_record(property_item, synonym_set) + + assert len(synonym_set) == 0 + + +# ==================== Test compose_concept_code_record ==================== + +def test_compose_concept_code_record_success(): + """Test successful concept code record composition""" + property_item = { + MODEL: "Model1", + PROPERTY: "Property1", + "permissibleValues": [ + { + "value": "val1", + "ncit_concept_code": "C12345" + }, + { + "value": "val2", + "ncit_concept_code": "C67890" + } + ] + } + + concept_code_set = set() + compose_concept_code_record(property_item, concept_code_set) + + assert len(concept_code_set) == 2 + assert ("Model1", "Property1", "val1", "C12345") in concept_code_set + assert ("Model1", "Property1", "val2", "C67890") in concept_code_set + + +def test_compose_concept_code_record_no_concept_codes(): + """Test when there are no concept codes""" + property_item = { + MODEL: "Model1", + PROPERTY: "Property1", + "permissibleValues": [ + {"value": "val1"} + ] + } + + concept_code_set = set() + compose_concept_code_record(property_item, concept_code_set) + + assert len(concept_code_set) == 0 + + +def test_compose_concept_code_record_none_pv_list(): + """Test with None permissibleValues list""" + property_item = { + MODEL: "Model1", + PROPERTY: "Property1", + "permissibleValues": None + } + + concept_code_set = set() + compose_concept_code_record(property_item, concept_code_set) + + assert len(concept_code_set) == 0 + + +# ==================== Test get_pv_by_property_version ==================== + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_pv_by_property_version_success( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test successful retrieval of PVs by property and version""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, set(), set()) + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + + result = get_pv_by_property_version(mock_configs, mock_logger, "prop1", "1.0", "model1", mock_mongo_dao) + + assert result == property_records[0] + + +@patch('pv_puller_v2.APIInvoker') +def test_get_pv_by_property_version_no_api_url( + mock_api_invoker_class, + mock_mongo_dao, + mock_logger +): + """Test error when STS API URL is not configured""" + # Create a config with STS_API_ONE_URL_V2 key but empty value + configs = {STS_API_ONE_URL_V2: ""} + + result = get_pv_by_property_version(configs, mock_logger, "prop1", "1.0", "model1", mock_mongo_dao) + + assert result is None + + +@patch('pv_puller_v2.get_all_pvs_by_version') +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_pv_by_property_version_no_version( + mock_process_sts, + mock_api_invoker_class, + mock_get_all_pvs, + mock_mongo_dao, + mock_logger +): + """When version is None, no record matches (API returns records with concrete version e.g. "1.0"), so function returns None.""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + # Use URL format that matches the replace pattern in the code: /version={prop_version} + configs = { + STS_API_ONE_URL_V2: "https://sts-api.example.com/v1/property/{property}/version={prop_version}", + STS_API_ALL_URL_V2: "https://sts-api.example.com/v1/property" + } + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, set(), set()) + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + + result = get_pv_by_property_version(configs, mock_logger, "prop1", None, "model1", mock_mongo_dao) + + # When prop_version is None, next() at line 274 finds no match (item[VERSION]="1.0" != None), + # so the function correctly returns None. + assert result is None + + +# ==================== Test pull_pv_lists_v2 ==================== + +@patch('pv_puller_v2.PVPullerV2') +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.get_logger') +def test_pull_pv_lists_v2_success( + mock_get_logger, + mock_api_invoker_class, + mock_puller_class, + mock_mongo_dao, + mock_configs +): + """Test successful pull_pv_lists_v2 execution""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + mock_puller = MagicMock() + mock_puller_class.return_value = mock_puller + + pull_pv_lists_v2(mock_configs, mock_mongo_dao) + + # Verify PVPullerV2 was instantiated + mock_puller_class.assert_called_once_with(mock_configs, mock_mongo_dao, mock_api_invoker_class.return_value) + + # Verify pull method was called + mock_puller.pull_property_pv_synonym_concept_codes.assert_called_once() + + +@patch('pv_puller_v2.PVPullerV2') +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.get_logger') +def test_pull_pv_lists_v2_keyboard_interrupt( + mock_get_logger, + mock_api_invoker_class, + mock_puller_class, + mock_mongo_dao, + mock_configs +): + """Test handling of KeyboardInterrupt""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + mock_puller = MagicMock() + mock_puller_class.return_value = mock_puller + mock_puller.pull_property_pv_synonym_concept_codes.side_effect = KeyboardInterrupt() + + # Should not raise exception + pull_pv_lists_v2(mock_configs, mock_mongo_dao) + + +@patch('pv_puller_v2.PVPullerV2') +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.get_logger') +def test_pull_pv_lists_v2_exception( + mock_get_logger, + mock_api_invoker_class, + mock_puller_class, + mock_mongo_dao, + mock_configs +): + """Test handling of general exceptions""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + mock_puller = MagicMock() + mock_puller_class.return_value = mock_puller + mock_puller.pull_property_pv_synonym_concept_codes.side_effect = Exception("Test error") + + # Should catch exception and log it + pull_pv_lists_v2(mock_configs, mock_mongo_dao) + + # Verify error was logged + mock_logger.critical.assert_called() + + +# ==================== Test get_all_pvs_by_version ==================== + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_success( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test successful retrieval and save of all PVs, synonyms, and concept codes""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1", "val2"]} + ] + synonym_records = {("syn1", "val1"), ("syn2", "val2")} + concept_code_records = {("model1", "prop1", "val1", "C123"), ("model1", "prop1", "val2", "C456")} + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, synonym_records, concept_code_records) + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = True + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify API was called with correct URL + expected_url = f"{mock_configs[STS_API_ALL_URL_V2]}/model1/?version=1.0" + mock_api_client.get_all_data_elements.assert_called_once_with(expected_url) + + # Verify data was processed + mock_process_sts.assert_called_once() + + # Verify data was saved + mock_mongo_dao.upsert_property_pv.assert_called_once_with(property_records) + mock_mongo_dao.insert_synonyms.assert_called_once_with(list(synonym_records)) + mock_mongo_dao.insert_concept_codes_v2.assert_called_once_with(list(concept_code_records)) + + +@patch('pv_puller_v2.APIInvoker') +def test_get_all_pvs_by_version_no_api_url( + mock_api_invoker_class, + mock_mongo_dao, + mock_logger +): + """Test behavior when STS API URL is not configured""" + configs = {STS_API_ONE_URL_V2: ""} # Empty string means not configured + + result = get_all_pvs_by_version(configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + assert result is None + mock_logger.error.assert_called_with("Invalid STS API URL.") + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_no_properties( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test behavior when no properties are retrieved""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + mock_api_client.get_all_data_elements.return_value = [] + mock_process_sts.return_value = (None, set(), set()) + + mock_mongo_dao.upsert_property_pv.return_value = (False, "Failed") + + result = get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Should return None when no properties + assert result is None + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_no_synonyms( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test behavior when no synonyms are found""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, set(), set()) # Empty synonyms set + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify "No synonym found!" was logged + assert any("No synonym found!" in str(call) for call in mock_logger.info.call_args_list) + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_no_concept_codes( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test behavior when no concept codes are found""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + synonym_records = {("syn1", "val1")} + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, synonym_records, set()) # Empty concept codes set + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = True + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify "No concept code found!" was logged + assert any("No concept code found!" in str(call) for call in mock_logger.info.call_args_list) + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_upsert_failure( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test handling when upsert of properties fails""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, set(), set()) + + mock_mongo_dao.upsert_property_pv.return_value = (False, "Database error") + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify error was logged + mock_logger.error.assert_called_with("Failed to pull and save Property PV!") + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_insert_synonyms_none( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test handling when insert_synonyms returns None""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + synonym_records = {("syn1", "val1")} + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, synonym_records, set()) + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = None # Returns None indicating failure + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify insert_synonyms was called + mock_mongo_dao.insert_synonyms.assert_called_once() + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_insert_concept_codes_none( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test handling when insert_concept_codes_v2 returns None""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]} + ] + synonym_records = {("syn1", "val1")} + concept_code_records = {("model1", "prop1", "val1", "C123")} + + mock_api_client.get_all_data_elements.return_value = [{"property": "prop1"}] + mock_process_sts.return_value = (property_records, synonym_records, concept_code_records) + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = None # Returns None indicating failure + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify insert_concept_codes_v2 was called + mock_mongo_dao.insert_concept_codes_v2.assert_called_once() + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_api_exception( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test handling of API exceptions""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + # API throws exception + mock_api_client.get_all_data_elements.side_effect = Exception("API connection error") + + # Should raise exception (not caught by this function) + with pytest.raises(Exception) as exc_info: + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + assert "API connection error" in str(exc_info.value) + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_multiple_records( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test with multiple properties, synonyms, and concept codes""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + property_records = [ + {PROPERTY: "prop1", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val1"]}, + {PROPERTY: "prop2", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val2"]}, + {PROPERTY: "prop3", MODEL: "model1", VERSION: "1.0", PROPERTY_PERMISSIBLE_VALUES: ["val3"]} + ] + synonym_records = {("syn1", "val1"), ("syn2", "val1"), ("syn3", "val2"), ("syn4", "val3")} + concept_code_records = { + ("model1", "prop1", "val1", "C111"), + ("model1", "prop2", "val2", "C222"), + ("model1", "prop3", "val3", "C333") + } + + mock_api_client.get_all_data_elements.return_value = [ + {"property": "prop1"}, + {"property": "prop2"}, + {"property": "prop3"} + ] + mock_process_sts.return_value = (property_records, synonym_records, concept_code_records) + + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = True + + get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Verify all data was saved + mock_mongo_dao.upsert_property_pv.assert_called_once_with(property_records) + assert len(mock_mongo_dao.insert_synonyms.call_args[0][0]) == 4 + assert len(mock_mongo_dao.insert_concept_codes_v2.call_args[0][0]) == 3 + + +@patch('pv_puller_v2.APIInvoker') +@patch('pv_puller_v2.process_sts_property_pv') +def test_get_all_pvs_by_version_empty_results_from_api( + mock_process_sts, + mock_api_invoker_class, + mock_mongo_dao, + mock_configs, + mock_logger +): + """Test handling when API returns empty results""" + mock_api_client = MagicMock() + mock_api_invoker_class.return_value = mock_api_client + + mock_api_client.get_all_data_elements.return_value = None + mock_process_sts.return_value = (None, None, None) + + mock_mongo_dao.upsert_property_pv.return_value = (False, "No data") + + result = get_all_pvs_by_version(mock_configs, mock_logger, "1.0", "model1", mock_mongo_dao) + + # Should return None when no properties + assert result is None + + +# ==================== Integration Tests ==================== + +@patch('pv_puller_v2.get_logger') +def test_pv_puller_integration_full_flow( + mock_get_logger, + mock_mongo_dao, + mock_api_client, + mock_configs +): + """Integration test for full PVPuller flow""" + mock_logger = MagicMock() + mock_get_logger.return_value = mock_logger + + # Setup API client mock to return realistic data + api_results = [{ + PROPERTY: "test_property", + MODEL: "test_model", + VERSION: "1.0.0", + "permissibleValues": [ + { + "value": "test_value", + "synonyms": ["test_synonym"], + "ncit_concept_code": "C123" + } + ] + }] + + mock_api_client.get_all_data_elements_v2.return_value = [api_results] + + # Setup MongoDB mock + mock_mongo_dao.upsert_property_pv.return_value = (True, "Success") + mock_mongo_dao.insert_synonyms.return_value = True + mock_mongo_dao.insert_concept_codes_v2.return_value = True + + # Create puller and run + puller = PVPullerV2(mock_configs, mock_mongo_dao, mock_api_client) + puller.pull_property_pv_synonym_concept_codes() + + # Verify expected calls + assert mock_mongo_dao.upsert_property_pv.called + assert mock_mongo_dao.insert_synonyms.called + assert mock_mongo_dao.insert_concept_codes_v2.called diff --git a/src/test/validators/test_cross_submission_validator.py b/src/test/validators/test_cross_submission_validator.py index 00a14a6d..45c9a784 100644 --- a/src/test/validators/test_cross_submission_validator.py +++ b/src/test/validators/test_cross_submission_validator.py @@ -8,9 +8,9 @@ current_directory = os.getcwd() sys.path.insert(0, current_directory + '/src') -from src.x_submission_validator import CrossSubmissionValidator -from src.common.constants import ( - STATUS_ERROR, STATUS_PASSED, FAILED, DATA_COMMON_NAME, STATUS, +from x_submission_validator import CrossSubmissionValidator +from common.constants import ( + STATUS_ERROR, STATUS_PASSED, DATA_COMMON_NAME, STATUS, SUBMISSION_STATUS_SUBMITTED, SUBMISSION_REL_STATUS_RELEASED, STUDY_ID, NODE_TYPE, NODE_ID, ORIN_FILE_NAME, ADDITION_ERRORS, UPDATED_AT, VALIDATED_AT ) @@ -72,7 +72,7 @@ def test_validate_submission_not_found(self, validator, mock_mongo_dao): result = validator.validate('non-existent-submission') - assert result == FAILED + assert result == STATUS_ERROR mock_mongo_dao.get_submission.assert_called_once_with('non-existent-submission') def test_validate_submission_no_data_commons(self, validator, mock_mongo_dao): @@ -86,7 +86,7 @@ def test_validate_submission_no_data_commons(self, validator, mock_mongo_dao): result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR def test_validate_submission_invalid_status(self, validator, mock_mongo_dao): """Test validate when submission has invalid status.""" @@ -100,7 +100,7 @@ def test_validate_submission_invalid_status(self, validator, mock_mongo_dao): result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR def test_validate_submission_no_metadata(self, validator, mock_mongo_dao, valid_submission): """Test validate when submission has no metadata records.""" @@ -109,7 +109,7 @@ def test_validate_submission_no_metadata(self, validator, mock_mongo_dao, valid_ result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR mock_mongo_dao.get_dataRecords_chunk.assert_called_once() def test_validate_success_no_conflicts(self, validator, mock_mongo_dao, valid_submission, sample_data_records): @@ -267,7 +267,7 @@ def test_validate_released_submission(self, validator, mock_mongo_dao, sample_da assert result == STATUS_PASSED - @patch('src.x_submission_validator.current_datetime') + @patch('x_submission_validator.current_datetime') def test_validate_nodes_timestamps(self, mock_datetime, validator, mock_mongo_dao, valid_submission, sample_data_records): """Test that validate_nodes sets proper timestamps.""" mock_datetime.return_value = '2024-01-01T12:00:00Z' @@ -297,7 +297,7 @@ def test_validate_empty_data_commons(self, validator, mock_mongo_dao): result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR def test_validate_none_data_commons(self, validator, mock_mongo_dao): """Test validate when dataCommons field is None.""" @@ -311,7 +311,7 @@ def test_validate_none_data_commons(self, validator, mock_mongo_dao): result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR def test_validate_missing_data_commons_key(self, validator, mock_mongo_dao): """Test validate when dataCommons key is missing from submission.""" @@ -324,4 +324,4 @@ def test_validate_missing_data_commons_key(self, validator, mock_mongo_dao): result = validator.validate('test-submission') - assert result == FAILED + assert result == STATUS_ERROR diff --git a/src/test/validators/test_essential_validator_delete_metadata.py b/src/test/validators/test_essential_validator_delete_metadata.py new file mode 100644 index 00000000..f04edad8 --- /dev/null +++ b/src/test/validators/test_essential_validator_delete_metadata.py @@ -0,0 +1,219 @@ +"""Unit tests for essential_validator TYPE_DELETE (Delete Metadata) flow: deleteOrphanedDataFiles and F008 orphan errors.""" +import json +import os +import sys +from unittest.mock import MagicMock, patch + +import pytest + +_this_dir = os.path.dirname(os.path.abspath(__file__)) +_project_root = os.path.dirname(os.path.dirname(os.path.dirname(_this_dir))) +sys.path.insert(0, os.path.join(_project_root, "src")) + +from common import constants +from essential_validator import essentialValidate + + +def _make_delete_message(overrides=None): + payload = { + constants.SQS_TYPE: constants.TYPE_DELETE, + constants.SUBMISSION_ID: "sub-1", + constants.NODE_TYPE: "Subject", + constants.NODE_IDS: ["n1", "n2"], + constants.DELETE_ALL: False, + constants.EXCLUSIVE_IDS: [], + } + if overrides is not None: + payload.update(overrides) + msg = MagicMock() + msg.body = json.dumps(payload) + return msg + + +def _run_one_delete_message(configs, job_queue, mongo_dao, msg): + """Run essentialValidate until one Delete Metadata message is processed, then break.""" + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + try: + essentialValidate(configs, job_queue, mongo_dao) + except KeyboardInterrupt: + pass + + +@pytest.fixture +def mock_configs(): + return { + constants.SQS_NAME: "test-queue", + constants.MODEL_FILE_DIR: "/tmp/models", + constants.TIER_CONFIG: None, + } + + +@pytest.fixture +def mock_mongo_dao(): + dao = MagicMock() + dao.search_nodes_by_type_and_submission.return_value = ["n1", "n2"] + return dao + + +class TestDeleteMessageDeleteOrphanedDataFiles: + """deleteOrphanedDataFiles default and passing to MetadataRemover.""" + + def test_message_without_delete_orphaned_data_files_calls_remove_metadata_with_false( + self, mock_configs, mock_mongo_dao + ): + """When message omits deleteOrphanedDataFiles, remove_metadata is called with delete_orphaned_data_files=False.""" + msg = _make_delete_message() + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + with patch("essential_validator.MetadataRemover") as mock_remover_class: + mock_remover = MagicMock() + mock_remover.submission = {constants.ID: "sub-1", constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED} + mock_remover.remove_metadata.return_value = (True, []) + mock_remover_class.return_value = mock_remover + try: + essentialValidate(mock_configs, job_queue, mock_mongo_dao) + except KeyboardInterrupt: + pass + + mock_remover.remove_metadata.assert_called_once() + # remove_metadata(submission_id, node_type, node_ids, delete_orphaned_data_files) + call_args = mock_remover.remove_metadata.call_args[0] + assert call_args[3] is False + + def test_message_with_delete_orphaned_data_files_true_calls_remove_metadata_with_true( + self, mock_configs, mock_mongo_dao + ): + """When message has deleteOrphanedDataFiles true, remove_metadata is called with True.""" + msg = _make_delete_message({constants.DELETE_ORPHANED_DATA_FILES: True}) + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + with patch("essential_validator.MetadataRemover") as mock_remover_class: + mock_remover = MagicMock() + mock_remover.submission = {constants.ID: "sub-1", constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED} + mock_remover.remove_metadata.return_value = (True, []) + mock_remover_class.return_value = mock_remover + try: + essentialValidate(mock_configs, job_queue, mock_mongo_dao) + except KeyboardInterrupt: + pass + + call_args = mock_remover.remove_metadata.call_args[0] + assert call_args[3] is True + + +class TestDeleteSuccessAppendsOrphanErrors: + """set_submission_validation_status receives combined fileErrors (existing + orphan F008).""" + + def test_set_submission_validation_status_called_with_combined_file_errors( + self, mock_configs, mock_mongo_dao + ): + """On successful delete with orphan_errors, set_submission_validation_status is called with fileErrors = existing + orphan_errors.""" + msg = _make_delete_message() + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + existing_error = {"submittedID": "old.csv", "errors": []} + orphan_error = {"submittedID": "orphan.csv", "errors": [{"code": "F008"}]} + # Re-fetch returns fresh submission with existing fileErrors; combined = existing + orphan_errors + mock_mongo_dao.get_submission.return_value = { + constants.ID: "sub-1", + constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED, + constants.FILE_ERRORS: [existing_error], + } + + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + with patch("essential_validator.MetadataRemover") as mock_remover_class: + mock_remover = MagicMock() + mock_remover.submission = { + constants.ID: "sub-1", + constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED, + constants.FILE_ERRORS: [existing_error], + } + mock_remover.remove_metadata.return_value = (True, [orphan_error]) + mock_remover_class.return_value = mock_remover + try: + essentialValidate(mock_configs, job_queue, mock_mongo_dao) + except KeyboardInterrupt: + pass + + mock_mongo_dao.set_submission_validation_status.assert_called_once() + call_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + # set_submission_validation_status(submission, file_status, metadata_status, cross_submission_status, fileErrors, is_delete, ...) + file_errors = call_args[4] + assert file_errors == [existing_error, orphan_error] + assert call_args[5] is True + + def test_set_submission_validation_status_with_no_existing_file_errors( + self, mock_configs, mock_mongo_dao + ): + """When submission has no fileErrors, combined is just orphan_errors.""" + msg = _make_delete_message() + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + orphan_error = {"submittedID": "orphan.csv", "errors": [{"code": "F008"}]} + # Re-fetch returns None so we use validator.submission (no FILE_ERRORS); existing = [] + mock_mongo_dao.get_submission.return_value = None + + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + with patch("essential_validator.MetadataRemover") as mock_remover_class: + mock_remover = MagicMock() + mock_remover.submission = { + constants.ID: "sub-1", + constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED, + } + mock_remover.remove_metadata.return_value = (True, [orphan_error]) + mock_remover_class.return_value = mock_remover + try: + essentialValidate(mock_configs, job_queue, mock_mongo_dao) + except KeyboardInterrupt: + pass + + call_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + file_errors = call_args[4] + assert file_errors == [orphan_error] + + def test_set_submission_validation_status_no_orphan_errors_keeps_existing_only( + self, mock_configs, mock_mongo_dao + ): + """When remove_metadata returns no orphan_errors, fileErrors is only existing.""" + msg = _make_delete_message() + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + existing_error = {"submittedID": "old.csv", "errors": []} + mock_mongo_dao.get_submission.return_value = { + constants.ID: "sub-1", + constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED, + constants.FILE_ERRORS: [existing_error], + } + + with patch("essential_validator.ModelFactory"): + with patch("essential_validator.set_scale_in_protection"): + with patch("essential_validator.MetadataRemover") as mock_remover_class: + mock_remover = MagicMock() + mock_remover.submission = { + constants.ID: "sub-1", + constants.METADATA_VALIDATION_STATUS: constants.STATUS_PASSED, + constants.FILE_ERRORS: [existing_error], + } + mock_remover.remove_metadata.return_value = (True, []) + mock_remover_class.return_value = mock_remover + try: + essentialValidate(mock_configs, job_queue, mock_mongo_dao) + except KeyboardInterrupt: + pass + + call_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + file_errors = call_args[4] + assert file_errors == [existing_error] diff --git a/src/test/validators/test_metadata_batch_validation.py b/src/test/validators/test_metadata_batch_validation.py new file mode 100644 index 00000000..83ad7f0e --- /dev/null +++ b/src/test/validators/test_metadata_batch_validation.py @@ -0,0 +1,744 @@ +import pytest +import json +import sys +import os +from unittest.mock import MagicMock, patch + +# Resolve project root from this file (src/test/validators/...) and add src to path. +_this_dir = os.path.dirname(os.path.abspath(__file__)) +_project_root = os.path.dirname(os.path.dirname(os.path.dirname(_this_dir))) +sys.path.insert(0, os.path.join(_project_root, 'src')) + +from metadata_validator import metadataValidate, MetaDataValidator +from common import constants +from pymongo import errors + + +# --------------------------------------------------------------------------- +# increment_completed_batches (mongo_dao) +# --------------------------------------------------------------------------- + +class TestIncrementCompletedBatches: + + def _setup_mock_db(self, mock_client_class): + mock_client = MagicMock() + mock_client_class.return_value = mock_client + mock_db = MagicMock() + mock_client.__getitem__.return_value = mock_db + mock_validation_col = MagicMock() + + def db_getitem(key): + if key == constants.VALIDATION_COLLECTION: + return mock_validation_col + return MagicMock() + + mock_db.__getitem__.side_effect = db_getitem + return mock_validation_col + + @patch("common.mongo_dao.MongoClient") + def test_increments_and_returns_count(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 2} + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5) + + assert count == 2 + assert is_last is False + assert failed == 0 + assert worst == constants.STATUS_PASSED + assert details == [] + + @patch("common.mongo_dao.MongoClient") + def test_is_last_batch_when_count_equals_total(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 5} + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5) + + assert count == 5 + assert is_last is True + assert failed == 0 + + @patch("common.mongo_dao.MongoClient") + def test_is_last_batch_when_count_exceeds_total(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 6} + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5) + + assert count == 6 + assert is_last is True + assert failed == 0 + + @patch("common.mongo_dao.MongoClient") + def test_batch_failed_increments_both_counters(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 3, constants.FAILED_BATCHES: 1} + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5, batch_failed=True) + + assert count == 3 + assert is_last is False + assert failed == 1 + update_arg = col.find_one_and_update.call_args[0][1] + assert update_arg['$inc'] == {constants.COMPLETED_BATCHES: 1, constants.FAILED_BATCHES: 1} + + @patch("common.mongo_dao.MongoClient") + def test_batch_success_does_not_increment_failed(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 3} + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5, batch_failed=False) + + assert failed == 0 + update_arg = col.find_one_and_update.call_args[0][1] + assert update_arg['$inc'] == {constants.COMPLETED_BATCHES: 1} + + @patch("common.mongo_dao.MongoClient") + def test_batch_status_uses_max_operator(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = { + constants.COMPLETED_BATCHES: 2, constants.WORST_BATCH_STATUS: 2, + } + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches( + "val-1", 5, batch_status=constants.STATUS_ERROR + ) + + update_arg = col.find_one_and_update.call_args[0][1] + assert update_arg['$max'] == {constants.WORST_BATCH_STATUS: constants.STATUS_PRECEDENCE[constants.STATUS_ERROR]} + assert worst == constants.STATUS_ERROR + + @patch("common.mongo_dao.MongoClient") + def test_worst_status_maps_back_from_numeric(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = { + constants.COMPLETED_BATCHES: 3, constants.WORST_BATCH_STATUS: 2, + } + + dao = MongoDao("mongodb://localhost", "test_db") + _, _, _, worst, _ = dao.increment_completed_batches("val-1", 5, batch_status=constants.STATUS_ERROR) + + assert worst == constants.STATUS_ERROR + + @patch("common.mongo_dao.MongoClient") + def test_status_detail_uses_push_operator(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = { + constants.COMPLETED_BATCHES: 1, + constants.BATCH_STATUS_DETAILS: ['Submission not found: sub-1'], + } + + dao = MongoDao("mongodb://localhost", "test_db") + _, _, _, _, details = dao.increment_completed_batches( + "val-1", 5, status_detail='Submission not found: sub-1' + ) + + update_arg = col.find_one_and_update.call_args[0][1] + assert update_arg['$push'] == {constants.BATCH_STATUS_DETAILS: 'Submission not found: sub-1'} + assert details == ['Submission not found: sub-1'] + + @patch("common.mongo_dao.MongoClient") + def test_no_push_when_status_detail_is_none(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 1} + + dao = MongoDao("mongodb://localhost", "test_db") + dao.increment_completed_batches("val-1", 5, status_detail=None) + + update_arg = col.find_one_and_update.call_args[0][1] + assert '$push' not in update_arg + + @patch("common.mongo_dao.MongoClient") + def test_no_max_when_batch_status_is_none(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = {constants.COMPLETED_BATCHES: 1} + + dao = MongoDao("mongodb://localhost", "test_db") + dao.increment_completed_batches("val-1", 5, batch_status=None) + + update_arg = col.find_one_and_update.call_args[0][1] + assert '$max' not in update_arg + + @patch("common.mongo_dao.MongoClient") + def test_validation_not_found_returns_none(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.return_value = None + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-missing", 5) + + assert count is None + assert is_last is False + assert failed == 0 + assert worst is None + assert details == [] + + @patch("common.mongo_dao.MongoClient") + def test_pymongo_error_returns_none(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find_one_and_update.side_effect = errors.PyMongoError("db error") + + dao = MongoDao("mongodb://localhost", "test_db") + count, is_last, failed, worst, details = dao.increment_completed_batches("val-1", 5) + + assert count is None + assert is_last is False + assert failed == 0 + assert worst is None + assert details == [] + + +# --------------------------------------------------------------------------- +# get_dataRecords_by_ids (mongo_dao) +# --------------------------------------------------------------------------- + +class TestGetDataRecordsByIds: + + def _setup_mock_db(self, mock_client_class): + mock_client = MagicMock() + mock_client_class.return_value = mock_client + mock_db = MagicMock() + mock_client.__getitem__.return_value = mock_db + mock_data_col = MagicMock() + + def db_getitem(key): + if key == constants.DATA_COLLECTION: + return mock_data_col + return MagicMock() + + mock_db.__getitem__.side_effect = db_getitem + return mock_data_col + + @patch("common.mongo_dao.MongoClient") + def test_full_match(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + records = [{constants.ID: 'r1'}, {constants.ID: 'r2'}] + col.find.return_value = records + + dao = MongoDao("mongodb://localhost", "test_db") + result = dao.get_dataRecords_by_ids(['r1', 'r2']) + + assert len(result) == 2 + + @patch("common.mongo_dao.MongoClient") + def test_partial_match_logs_warning(self, mock_client_class, caplog): + import logging + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find.return_value = [{constants.ID: 'r1'}] + + dao = MongoDao("mongodb://localhost", "test_db") + with caplog.at_level(logging.WARNING): + result = dao.get_dataRecords_by_ids(['r1', 'r2', 'r3']) + + assert len(result) == 1 + warning_msgs = [r.getMessage() for r in caplog.records if r.levelno == logging.WARNING] + assert any('Partial match' in m for m in warning_msgs) + + @patch("common.mongo_dao.MongoClient") + def test_empty_result(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find.return_value = [] + + dao = MongoDao("mongodb://localhost", "test_db") + result = dao.get_dataRecords_by_ids(['r1']) + + assert result == [] + + @patch("common.mongo_dao.MongoClient") + def test_pymongo_error_returns_none(self, mock_client_class): + from common.mongo_dao import MongoDao + col = self._setup_mock_db(mock_client_class) + col.find.side_effect = errors.PyMongoError("db error") + + dao = MongoDao("mongodb://localhost", "test_db") + result = dao.get_dataRecords_by_ids(['r1']) + + assert result is None + + +# --------------------------------------------------------------------------- +# Batch handler (metadataValidate) — integration tests +# --------------------------------------------------------------------------- + +class TestBatchHandler: + """Tests for the constants.TYPE_METADATA_VALIDATE_BATCH branch in metadataValidate.""" + + @pytest.fixture + def mock_configs(self): + from common.constants import MODEL_FILE_DIR, TIER_CONFIG, SQS_NAME + return { + MODEL_FILE_DIR: '/fake/models', + TIER_CONFIG: '/fake/tier.json', + SQS_NAME: 'test-queue', + } + + @pytest.fixture + def mock_model_store(self): + store = MagicMock() + model = MagicMock() + model.model = True + model.get_nodes.return_value = ['node1'] + store.get_model_by_data_common_version.return_value = model + return store + + @pytest.fixture + def mock_mongo_dao(self): + dao = MagicMock() + dao.get_dataRecords_by_ids.return_value = [{constants.ID: 'r1'}, {constants.ID: 'r2'}] + dao.get_submission.return_value = { + '_id': 'sub-1', + constants.DATA_COMMON_NAME: 'CDS', + constants.MODEL_VERSION: '1.0', + constants.STUDY_ID: 'study-1', + } + dao.find_study_by_id.return_value = {'studyName': 'Test Study'} + dao.find_organization_name_by_study_id.return_value = ['Test Org'] + dao.increment_completed_batches.return_value = (1, False, 0, constants.STATUS_PASSED, []) + dao.update_validation_status.return_value = True + dao.set_submission_validation_status.return_value = True + return dao + + def _make_batch_msg(self, overrides=None): + payload = { + constants.SQS_TYPE: constants.TYPE_METADATA_VALIDATE_BATCH, + constants.SUBMISSION_ID: 'sub-1', + constants.VALIDATION_ID: 'val-1', + constants.DATA_RECORD_IDS: ['r1', 'r2'], + constants.TOTAL_BATCHES: 3, + constants.BATCH_INDEX: 0, + constants.SCOPE: 'New', + } + if overrides: + payload.update(overrides) + msg = MagicMock() + msg.body = json.dumps(payload) + return msg + + def _run_one_message(self, mock_configs, mock_model_store, mock_mongo_dao, msg): + """Run the service loop for exactly one message then break.""" + job_queue = MagicMock() + job_queue.receiveMsgs.side_effect = [[msg], KeyboardInterrupt] + + with patch('metadata_validator.ModelFactory', return_value=mock_model_store): + with patch('metadata_validator.set_scale_in_protection'): + metadataValidate(mock_configs, job_queue, mock_mongo_dao) + + # -- happy path (non-last batch) -- + + def test_happy_path_non_last_batch(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg() + mock_mongo_dao.increment_completed_batches.return_value = (1, False, 0, constants.STATUS_PASSED, []) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.get_dataRecords_by_ids.assert_called_once_with(['r1', 'r2']) + mock_mongo_dao.increment_completed_batches.assert_called_once() + call_kwargs = mock_mongo_dao.increment_completed_batches.call_args[1] + assert call_kwargs['batch_failed'] is False + assert call_kwargs['status_detail'] is None + mock_mongo_dao.update_validation_status.assert_not_called() + msg.delete.assert_called_once() + + # -- happy path (last batch) -- + + def test_happy_path_last_batch_sets_final_status(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 0, constants.STATUS_PASSED, []) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.update_validation_status.assert_called_once() + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[0] == 'val-1' + assert args[1] == constants.STATUS_PASSED + assert kwargs['status_detail'] is None + + mock_mongo_dao.set_submission_validation_status.assert_called_once() + sub_kwargs = mock_mongo_dao.set_submission_validation_status.call_args[1] + assert sub_kwargs['status_detail'] is None + msg.delete.assert_called_once() + + # -- validate_nodes throws, counter still increments -- + + def test_validate_nodes_exception_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg() + + with patch.object(MetaDataValidator, 'validate_nodes', side_effect=RuntimeError('boom')): + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + call_kwargs = mock_mongo_dao.increment_completed_batches.call_args[1] + assert call_kwargs['batch_failed'] is True + assert call_kwargs['batch_status'] == constants.FAILED + + def test_validate_nodes_exception_last_batch_uses_worst_status(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 1, constants.STATUS_ERROR, []) + + with patch.object(MetaDataValidator, 'validate_nodes', side_effect=RuntimeError('boom')): + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + assert args[1] == constants.STATUS_ERROR + + # -- model not available -- + + def test_model_unavailable_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + bad_model = MagicMock() + bad_model.model = None + bad_model.get_nodes.return_value = [] + mock_model_store.get_model_by_data_common_version.return_value = bad_model + + msg = self._make_batch_msg() + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_called_once() + + def test_model_unavailable_last_batch_preserves_status_error(self, mock_configs, mock_model_store, mock_mongo_dao): + bad_model = MagicMock() + bad_model.model = None + bad_model.get_nodes.return_value = [] + mock_model_store.get_model_by_data_common_version.return_value = bad_model + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['CDS model version "1.0" is not available.'] + ) + + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('not available' in d for d in kwargs['status_detail']) + # Caller passes FAILED; DAO does not write it to submission + sub_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + assert sub_args[2] == constants.FAILED + + # -- missing study -- + + def test_missing_study_id_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_submission.return_value = { + '_id': 'sub-1', + constants.DATA_COMMON_NAME: 'CDS', + constants.MODEL_VERSION: '1.0', + } + msg = self._make_batch_msg() + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_called_once() + + def test_study_not_found_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.find_study_by_id.return_value = None + msg = self._make_batch_msg() + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_called_once() + + def test_missing_study_last_batch_sets_failed(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.find_study_by_id.return_value = None + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['Invalid submission, no study found, sub-1!'] + ) + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('no study found' in d for d in kwargs['status_detail']) + sub_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + assert sub_args[2] == constants.FAILED + + # -- submission not found -- + + def test_submission_not_found_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_submission.return_value = None + msg = self._make_batch_msg() + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_called_once() + + def test_submission_not_found_last_batch_sets_failed(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_submission.return_value = None + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['Submission not found: sub-1'] + ) + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('Submission not found' in d for d in kwargs['status_detail']) + mock_mongo_dao.set_submission_validation_status.assert_not_called() + + # -- no data records found -- + + def test_no_data_records_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_dataRecords_by_ids.return_value = [] + msg = self._make_batch_msg() + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_called_once() + + def test_no_data_records_last_batch_updates_submission_status(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_dataRecords_by_ids.return_value = [] + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['No data records found for provided IDs in batch 0'] + ) + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.update_validation_status.assert_called_once() + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('No data records found' in d for d in kwargs['status_detail']) + + mock_mongo_dao.set_submission_validation_status.assert_called_once() + sub_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + sub_kwargs = mock_mongo_dao.set_submission_validation_status.call_args[1] + assert sub_args[2] == constants.FAILED + assert any('No data records found' in d for d in sub_kwargs['status_detail']) + + # -- missing scope treated as invalid message -- + + def test_missing_scope_treated_as_invalid(self, mock_configs, mock_model_store, mock_mongo_dao): + payload = { + constants.SQS_TYPE: constants.TYPE_METADATA_VALIDATE_BATCH, + constants.SUBMISSION_ID: 'sub-1', + constants.VALIDATION_ID: 'val-1', + constants.DATA_RECORD_IDS: ['r1', 'r2'], + constants.TOTAL_BATCHES: 3, + constants.BATCH_INDEX: 0, + } + msg = MagicMock() + msg.body = json.dumps(payload) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.get_submission.assert_called_once() + mock_mongo_dao.get_dataRecords_by_ids.assert_not_called() + call_kwargs = mock_mongo_dao.increment_completed_batches.call_args[1] + assert call_kwargs['batch_failed'] is True + assert call_kwargs['batch_status'] == constants.FAILED + msg.delete.assert_called_once() + + def test_missing_scope_last_batch_updates_submission_status(self, mock_configs, mock_model_store, mock_mongo_dao): + payload = { + constants.SQS_TYPE: constants.TYPE_METADATA_VALIDATE_BATCH, + constants.SUBMISSION_ID: 'sub-1', + constants.VALIDATION_ID: 'val-1', + constants.DATA_RECORD_IDS: ['r1', 'r2'], + constants.TOTAL_BATCHES: 1, + constants.BATCH_INDEX: 0, + } + msg = MagicMock() + msg.body = json.dumps(payload) + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['Missing required field: scope'] + ) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.update_validation_status.assert_called_once() + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert 'Missing required field: scope' in kwargs['status_detail'] + + mock_mongo_dao.set_submission_validation_status.assert_called_once() + sub_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + sub_kwargs = mock_mongo_dao.set_submission_validation_status.call_args[1] + assert sub_args[2] == constants.FAILED + assert 'Missing required field: scope' in sub_kwargs['status_detail'] + + # -- zero total_batches treated as invalid message -- + + def test_zero_total_batches_treated_as_invalid(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 0}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_not_called() + mock_mongo_dao.get_dataRecords_by_ids.assert_not_called() + msg.delete.assert_called_once() + + # -- worst-status precedence -- + + def test_worst_status_precedence_used_as_final(self, mock_configs, mock_model_store, mock_mongo_dao): + """Final status comes from worst_status in the 5-tuple, not from local batch state.""" + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 0, constants.STATUS_WARNING, []) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + assert args[1] == constants.STATUS_WARNING + + def test_last_batch_succeeds_but_prior_batch_failed_uses_worst_status(self, mock_configs, mock_model_store, mock_mongo_dao): + """When the finalizing batch succeeds locally but worst_status is Failed, + the final status must reflect the worst across all batches.""" + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + mock_mongo_dao.increment_completed_batches.return_value = ( + 3, True, 1, constants.FAILED, ['Batch 1: Submission not found'] + ) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert kwargs['status_detail'] == ['Batch 1: Submission not found'] + + def test_multiple_prior_failures_accumulated_in_details(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + accumulated = ['Batch 0: No data records found', 'Batch 1: Missing scope'] + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 2, constants.STATUS_ERROR, accumulated) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.STATUS_ERROR + assert kwargs['status_detail'] == accumulated + + # -- accumulated details written as array -- + + def test_accumulated_details_written_as_array(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + details_array = ['Batch 0: error A', 'Batch 1: error B'] + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 2, constants.STATUS_ERROR, details_array) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert isinstance(kwargs['status_detail'], list) + assert kwargs['status_detail'] == details_array + + sub_kwargs = mock_mongo_dao.set_submission_validation_status.call_args[1] + assert isinstance(sub_kwargs['status_detail'], list) + assert sub_kwargs['status_detail'] == details_array + + # -- empty batch_details on success yields None -- + + def test_success_with_empty_details_sets_none(self, mock_configs, mock_model_store, mock_mongo_dao): + msg = self._make_batch_msg({constants.BATCH_INDEX: 2}) + mock_mongo_dao.increment_completed_batches.return_value = (3, True, 0, constants.STATUS_PASSED, []) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert kwargs['status_detail'] is None + + # -- increment_completed_batches exception skips finalization -- + + def test_increment_exception_skips_finalization(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.increment_completed_batches.side_effect = RuntimeError("db down") + msg = self._make_batch_msg() + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.update_validation_status.assert_not_called() + mock_mongo_dao.set_submission_validation_status.assert_not_called() + msg.delete.assert_called_once() + + # -- increment_completed_batches returns None (DB error, no exception) -- + + def test_increment_returns_none_logs_error(self, mock_configs, mock_model_store, mock_mongo_dao, caplog): + import logging + mock_mongo_dao.increment_completed_batches.return_value = (None, False, 0, None, []) + msg = self._make_batch_msg() + + with caplog.at_level(logging.ERROR): + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.update_validation_status.assert_not_called() + mock_mongo_dao.set_submission_validation_status.assert_not_called() + assert any('validation may be stuck' in r.getMessage() for r in caplog.records) + + # -- missing validation_id -- + + def test_missing_validation_id_skips_processing(self, mock_configs, mock_model_store, mock_mongo_dao): + """When validation_id is absent, the function returns before try/finally, + so no increment and no data record fetch should occur.""" + msg = self._make_batch_msg({constants.VALIDATION_ID: None}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + mock_mongo_dao.increment_completed_batches.assert_not_called() + mock_mongo_dao.get_dataRecords_by_ids.assert_not_called() + msg.delete.assert_called_once() + + # -- missing constants.DATA_COMMON_NAME on last batch -- + + def test_missing_datacommon_last_batch_sets_failed(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_submission.return_value = { + '_id': 'sub-1', + constants.MODEL_VERSION: '1.0', + constants.STUDY_ID: 'study-1', + } + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['Invalid submission, no datacommon found, sub-1!'] + ) + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('no datacommon found' in d for d in kwargs['status_detail']) + sub_args = mock_mongo_dao.set_submission_validation_status.call_args[0] + assert sub_args[2] == constants.FAILED + + # -- get_dataRecords_by_ids returns None (DB error) -- + + def test_data_records_db_error_returns_none_still_increments(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_dataRecords_by_ids.return_value = None + msg = self._make_batch_msg() + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + call_kwargs = mock_mongo_dao.increment_completed_batches.call_args[1] + assert call_kwargs['batch_failed'] is True + assert call_kwargs['batch_status'] == constants.FAILED + + def test_data_records_db_error_last_batch_sets_failed(self, mock_configs, mock_model_store, mock_mongo_dao): + mock_mongo_dao.get_dataRecords_by_ids.return_value = None + mock_mongo_dao.increment_completed_batches.return_value = ( + 1, True, 1, constants.FAILED, ['No data records found for provided IDs in batch 0'] + ) + msg = self._make_batch_msg({constants.TOTAL_BATCHES: 1}) + + self._run_one_message(mock_configs, mock_model_store, mock_mongo_dao, msg) + + args = mock_mongo_dao.update_validation_status.call_args[0] + kwargs = mock_mongo_dao.update_validation_status.call_args[1] + assert args[1] == constants.FAILED + assert any('No data records found' in d for d in kwargs['status_detail']) diff --git a/src/test/validators/test_metadata_remover.py b/src/test/validators/test_metadata_remover.py new file mode 100644 index 00000000..bce0c60c --- /dev/null +++ b/src/test/validators/test_metadata_remover.py @@ -0,0 +1,443 @@ +"""Unit tests for MetadataRemover: delete metadata SQS flow, deleteOrphanedDataFiles, and F008 orphan errors.""" +import os +import sys +from unittest.mock import MagicMock, patch + +import pytest + +_this_dir = os.path.dirname(os.path.abspath(__file__)) +_project_root = os.path.dirname(os.path.dirname(os.path.dirname(_this_dir))) +sys.path.insert(0, os.path.join(_project_root, "src")) + +from common import constants +from metadata_remover import MetadataRemover + + +# --------------------------------------------------------------------------- +# __init__ and remove_metadata return contract +# --------------------------------------------------------------------------- + +def test_errors_initialized(): + """MetadataRemover initializes self.errors to empty list.""" + mock_dao = MagicMock() + mock_store = MagicMock() + remover = MetadataRemover(mock_dao, mock_store) + assert remover.errors == [] + assert isinstance(remover.errors, list) + + +def test_remove_metadata_returns_tuple_on_invalid_submission(): + """When get_submission returns None, remove_metadata returns (False, []).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = None + mock_store = MagicMock() + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + result, orphan_errors = remover.remove_metadata("sub-1", "Subject", ["n1"]) + + assert result is False + assert orphan_errors == [] + mock_dao.get_submission.assert_called_once_with("sub-1") + + +def test_remove_metadata_logs_no_record_when_submission_missing(caplog): + """Missing submission document logs no-record-found (not missing dataCommons).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = None + mock_store = MagicMock() + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + with caplog.at_level("ERROR", logger="Essential Validator"): + remover.remove_metadata("sub-1", "Subject", ["n1"]) + assert "no record found" in caplog.text + assert f"missing {constants.DATA_COMMON_NAME}" not in caplog.text + + +def test_remove_metadata_logs_missing_datacommons(caplog): + """Submission without dataCommons field logs missing field (not no-record-found).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = {"_id": "sub-1"} + mock_store = MagicMock() + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + with caplog.at_level("ERROR", logger="Essential Validator"): + remover.remove_metadata("sub-1", "Subject", ["n1"]) + assert f"missing {constants.DATA_COMMON_NAME}" in caplog.text + assert "no record found" not in caplog.text + + +def test_remove_metadata_returns_tuple_on_no_datacommon(): + """When submission has no dataCommons, remove_metadata returns (False, []).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = {"_id": "sub-1"} # no dataCommons + mock_store = MagicMock() + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + result, orphan_errors = remover.remove_metadata("sub-1", "Subject", ["n1"]) + + assert result is False + assert orphan_errors == [] + + +def test_remove_metadata_returns_tuple_on_model_unavailable(): + """When model is not available, remove_metadata returns (False, []).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = { + "_id": "sub-1", + constants.DATA_COMMON_NAME: "dc1", + constants.ROOT_PATH: "r", + constants.MODEL_VERSION: "v1", + constants.BATCH_BUCKET: "b1", + } + mock_store = MagicMock() + bad_model = MagicMock() + bad_model.model = None + bad_model.get_nodes.return_value = [] + mock_store.get_model_by_data_common_version.return_value = bad_model + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + result, orphan_errors = remover.remove_metadata("sub-1", "Subject", ["n1"]) + + assert result is False + assert orphan_errors == [] + + +def test_remove_metadata_returns_tuple_on_no_existed_nodes(): + """When validate_data returns None/empty, remove_metadata returns (False, []).""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = { + "_id": "sub-1", + constants.DATA_COMMON_NAME: "dc1", + constants.ROOT_PATH: "r", + constants.MODEL_VERSION: "v1", + constants.BATCH_BUCKET: "b1", + } + mock_store = MagicMock() + mock_model = MagicMock() + mock_model.model = {"nodes": []} + mock_model.get_nodes.return_value = [] + mock_model.get_file_nodes.return_value = {} + mock_store.get_model_by_data_common_version.return_value = mock_model + mock_dao.check_metadata_ids.return_value = [] # no nodes to delete + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + result, orphan_errors = remover.remove_metadata("sub-1", "Subject", ["n1"]) + + assert result is False + assert orphan_errors == [] + + +def test_remove_metadata_success_returns_true_and_orphan_errors(): + """On successful delete, remove_metadata returns (True, orphan_errors) from _find_orphaned_files_and_build_errors.""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = { + "_id": "sub-1", + constants.DATA_COMMON_NAME: "dc1", + constants.ROOT_PATH: "r", + constants.MODEL_VERSION: "v1", + constants.BATCH_BUCKET: "b1", + } + mock_store = MagicMock() + mock_model = MagicMock() + mock_model.model = {"nodes": [{}]} + mock_model.get_nodes.return_value = [{}] # truthy so "model available" check passes + mock_model.get_file_nodes.return_value = {} + mock_store.get_model_by_data_common_version.return_value = mock_model + mock_dao.check_metadata_ids.return_value = [{constants.NODE_ID: "n1", constants.NODE_TYPE: "Subject"}] + mock_dao.delete_data_records.return_value = True + + mock_bucket = MagicMock() + with patch("metadata_remover.S3Bucket", return_value=mock_bucket): + remover = MetadataRemover(mock_dao, mock_store) + with patch.object(remover, "process_children", return_value=True): + with patch.object( + remover, + "_find_orphaned_files_and_build_errors", + return_value=[{"submittedID": "orphan.csv", "errors": [{"code": "F008"}]}], + ): + result, orphan_errors = remover.remove_metadata( + "sub-1", "Subject", ["n1"], delete_orphaned_data_files=False + ) + + assert result is True + assert len(orphan_errors) == 1 + assert orphan_errors[0]["submittedID"] == "orphan.csv" + assert orphan_errors[0]["errors"][0]["code"] == "F008" + + +def test_delete_nodes_skips_s3_when_delete_orphaned_false(): + """When delete_orphaned_data_files is False, delete_nodes does not call delete_files_in_s3.""" + mock_dao = MagicMock() + mock_dao.delete_data_records.return_value = True + s3_info = {constants.FILE_NAME: "data.tsv"} + nodes = [ + { + constants.NODE_ID: "n1", + constants.NODE_TYPE: "File", + constants.S3_FILE_INFO: s3_info, + } + ] + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.submission_id = "sub-1" + remover.def_file_nodes = {} + with patch.object(remover, "delete_files_in_s3") as del_s3: + with patch.object(remover, "process_children", return_value=True): + assert remover.delete_nodes(nodes, delete_orphaned_data_files=False) is True + del_s3.assert_not_called() + mock_dao.delete_data_records.assert_called_once_with(nodes) + + +def test_process_children_skips_s3_when_delete_orphaned_false(): + """Cascaded child file node: Mongo delete runs but S3 is skipped when flag is False.""" + mock_dao = MagicMock() + deleted_parent = {constants.NODE_TYPE: "Study", constants.NODE_ID: "p1"} + child = { + constants.NODE_TYPE: "CDSFile", + constants.NODE_ID: "f1", + constants.PARENTS: [{constants.PARENT_TYPE: "Study", constants.PARENT_ID_VAL: "p1"}], + constants.S3_FILE_INFO: {constants.FILE_NAME: "child.tsv"}, + } + mock_dao.get_nodes_by_parents.side_effect = [(True, [child]), (True, [])] + mock_dao.delete_data_records.return_value = True + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.submission_id = "sub-1" + remover.def_file_nodes = {"CDSFile": {}} + with patch.object(remover, "delete_files_in_s3") as del_s3: + assert remover.process_children([deleted_parent], delete_orphaned_data_files=False) is True + del_s3.assert_not_called() + mock_dao.delete_data_records.assert_called_once_with([child]) + + +def test_process_children_removes_only_matching_parent_when_same_parent_type_twice(): + """Child may have multiple parents of the same type; deleting one removes only that edge.""" + mock_dao = MagicMock() + deleted_parent = {constants.NODE_TYPE: "Study", constants.NODE_ID: "study_a"} + child = { + constants.NODE_TYPE: "Sample", + constants.NODE_ID: "s1", + constants.PARENTS: [ + {constants.PARENT_TYPE: "Study", constants.PARENT_ID_VAL: "study_a"}, + {constants.PARENT_TYPE: "Study", constants.PARENT_ID_VAL: "study_b"}, + ], + } + mock_dao.get_nodes_by_parents.return_value = (True, [child]) + mock_dao.delete_data_records.return_value = True + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.submission_id = "sub-1" + remover.def_file_nodes = {} + assert remover.process_children([deleted_parent], delete_orphaned_data_files=False) is True + + mock_dao.update_data_records.assert_called_once() + updated = mock_dao.update_data_records.call_args[0][0] + assert len(updated) == 1 + assert updated[0][constants.PARENTS] == [ + {constants.PARENT_TYPE: "Study", constants.PARENT_ID_VAL: "study_b"}, + ] + mock_dao.delete_data_records.assert_not_called() + + +def test_process_children_calls_s3_when_delete_orphaned_true(): + """Cascaded child file node: delete_files_in_s3 runs when flag is True.""" + mock_dao = MagicMock() + deleted_parent = {constants.NODE_TYPE: "Study", constants.NODE_ID: "p1"} + s3_info = {constants.FILE_NAME: "child.tsv"} + child = { + constants.NODE_TYPE: "CDSFile", + constants.NODE_ID: "f1", + constants.PARENTS: [{constants.PARENT_TYPE: "Study", constants.PARENT_ID_VAL: "p1"}], + constants.S3_FILE_INFO: s3_info, + } + mock_dao.get_nodes_by_parents.side_effect = [(True, [child]), (True, [])] + mock_dao.delete_data_records.return_value = True + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.submission_id = "sub-1" + remover.def_file_nodes = {"CDSFile": {}} + with patch.object(remover, "delete_files_in_s3", return_value=True) as del_s3: + assert remover.process_children([deleted_parent], delete_orphaned_data_files=True) is True + del_s3.assert_called_once_with([s3_info]) + + +def test_delete_nodes_calls_s3_when_delete_orphaned_true(): + """When delete_orphaned_data_files is True, delete_nodes calls delete_files_in_s3 for file nodes.""" + mock_dao = MagicMock() + mock_dao.delete_data_records.return_value = True + s3_info = {constants.FILE_NAME: "data.tsv"} + nodes = [ + { + constants.NODE_ID: "n1", + constants.NODE_TYPE: "File", + constants.S3_FILE_INFO: s3_info, + } + ] + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.submission_id = "sub-1" + remover.def_file_nodes = {} + with patch.object(remover, "delete_files_in_s3", return_value=True) as del_s3: + with patch.object(remover, "process_children", return_value=True): + assert remover.delete_nodes(nodes, delete_orphaned_data_files=True) is True + del_s3.assert_called_once_with([s3_info]) + + +def test_remove_metadata_passes_delete_orphaned_data_files_to_find_orphans(): + """remove_metadata passes delete_orphaned_data_files flag to _find_orphaned_files_and_build_errors.""" + mock_dao = MagicMock() + mock_dao.get_submission.return_value = { + "_id": "sub-1", + constants.DATA_COMMON_NAME: "dc1", + constants.ROOT_PATH: "r", + constants.MODEL_VERSION: "v1", + constants.BATCH_BUCKET: "b1", + } + mock_store = MagicMock() + mock_model = MagicMock() + mock_model.model = {"nodes": [{}]} + mock_model.get_nodes.return_value = [{}] + mock_model.get_file_nodes.return_value = {} + mock_store.get_model_by_data_common_version.return_value = mock_model + mock_dao.check_metadata_ids.return_value = [{constants.NODE_ID: "n1", constants.NODE_TYPE: "Subject"}] + mock_dao.delete_data_records.return_value = True + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, mock_store) + with patch.object(remover, "process_children", return_value=True): + find_orphans = MagicMock(return_value=[]) + with patch.object(remover, "_find_orphaned_files_and_build_errors", find_orphans): + remover.remove_metadata("sub-1", "Subject", ["n1"], delete_orphaned_data_files=True) + + find_orphans.assert_called_once_with("sub-1", True) + + +# --------------------------------------------------------------------------- +# _find_orphaned_files_and_build_errors +# --------------------------------------------------------------------------- + +def test_find_orphaned_files_returns_empty_when_bucket_or_root_path_missing(): + """_find_orphaned_files_and_build_errors returns [] when bucket or root_path is not set.""" + mock_dao = MagicMock() + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.bucket = None + remover.root_path = "root" + assert remover._find_orphaned_files_and_build_errors("sub-1", False) == [] + remover.bucket = MagicMock() + remover.root_path = None + assert remover._find_orphaned_files_and_build_errors("sub-1", False) == [] + + +def test_find_orphaned_files_builds_f008_shape(): + """_find_orphaned_files_and_build_errors returns error dicts with F008 and file_validator shape.""" + mock_dao = MagicMock() + mock_dao.get_files_by_submission.return_value = [] # no manifest files + mock_dao.find_batch_by_file_name.return_value = None + + mock_bucket = MagicMock() + mock_bucket.bucket_name = "test-bucket" + mock_bucket.client.list_objects_v2.return_value = { + "Contents": [{"Key": "root/file/orphan.csv", "LastModified": "2024-01-01T00:00:00Z"}], + "NextContinuationToken": None, + } + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.bucket = mock_bucket + remover.root_path = "root" + + errors = remover._find_orphaned_files_and_build_errors("sub-1", delete_orphaned_data_files=False) + + assert len(errors) == 1 + err = errors[0] + assert err[constants.TYPE] == constants.DATA_FILE_TYPE + assert err[constants.QC_VALIDATION_TYPE] == constants.DATA_FILE_TYPE + assert err[constants.SUBMITTED_ID] == "orphan.csv" + assert err[constants.BATCH_ID] == "-" + assert err[constants.QC_SEVERITY] == constants.STATUS_ERROR + assert err[constants.ERRORS][0]["code"] == "F008" + + +def test_find_orphaned_files_skips_log_keys(): + """S3 keys containing /log are not considered orphaned files.""" + mock_dao = MagicMock() + mock_dao.get_files_by_submission.return_value = [] + mock_dao.find_batch_by_file_name.return_value = None + + mock_bucket = MagicMock() + mock_bucket.bucket_name = "test-bucket" + mock_bucket.client.list_objects_v2.return_value = { + "Contents": [{"Key": "root/file/log/foo.txt", "LastModified": None}], + "NextContinuationToken": None, + } + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.bucket = mock_bucket + remover.root_path = "root" + + errors = remover._find_orphaned_files_and_build_errors("sub-1", delete_orphaned_data_files=False) + + assert len(errors) == 0 + + +def test_find_orphaned_files_excludes_manifest_files(): + """Files still in get_files_by_submission are not reported as orphaned.""" + mock_dao = MagicMock() + mock_dao.get_files_by_submission.return_value = [ + {constants.S3_FILE_INFO: {constants.FILE_NAME: "in_manifest.csv"}} + ] + mock_dao.find_batch_by_file_name.return_value = None + + mock_bucket = MagicMock() + mock_bucket.bucket_name = "test-bucket" + mock_bucket.client.list_objects_v2.return_value = { + "Contents": [ + {"Key": "root/file/in_manifest.csv", "LastModified": None}, + {"Key": "root/file/orphan.csv", "LastModified": None}, + ], + "NextContinuationToken": None, + } + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.bucket = mock_bucket + remover.root_path = "root" + + errors = remover._find_orphaned_files_and_build_errors("sub-1", delete_orphaned_data_files=False) + + assert len(errors) == 1 + assert errors[0][constants.SUBMITTED_ID] == "orphan.csv" + + +def test_find_orphaned_files_calls_delete_files_in_s3_when_flag_true(): + """When delete_orphaned_data_files is True and orphans exist, delete_files_in_s3 is called.""" + mock_dao = MagicMock() + mock_dao.get_files_by_submission.return_value = [] + mock_dao.find_batch_by_file_name.return_value = None + + mock_bucket = MagicMock() + mock_bucket.bucket_name = "test-bucket" + mock_bucket.client.list_objects_v2.return_value = { + "Contents": [{"Key": "root/file/orphan.csv", "LastModified": None}], + "NextContinuationToken": None, + } + + with patch("metadata_remover.S3Bucket"): + remover = MetadataRemover(mock_dao, MagicMock()) + remover.bucket = mock_bucket + remover.root_path = "root" + delete_files = MagicMock() + remover.delete_files_in_s3 = delete_files + + remover._find_orphaned_files_and_build_errors("sub-1", delete_orphaned_data_files=True) + + delete_files.assert_called_once() + call_arg = delete_files.call_args[0][0] + assert len(call_arg) == 1 + assert call_arg[0][constants.FILE_NAME] == "orphan.csv" diff --git a/src/test/validators/validate_relationship_test.py b/src/test/validators/validate_relationship_test.py index d3404424..e27d8a64 100644 --- a/src/test/validators/validate_relationship_test.py +++ b/src/test/validators/validate_relationship_test.py @@ -5,10 +5,10 @@ current_directory = os.getcwd() sys.path.insert(0, current_directory + '/src') -from src.metadata_validator import MetaDataValidator -from src.common.constants import STATUS_WARNING, ERRORS, WARNINGS, STATUS_PASSED, STATUS_ERROR, DB, MONGO_DB -from src.common.error_messages import FAILED_VALIDATE_RECORDS -from src.common.model import DataModel +from metadata_validator import MetaDataValidator +from common.constants import STATUS_WARNING, ERRORS, WARNINGS, STATUS_PASSED, STATUS_ERROR, DB, MONGO_DB +from common.error_messages import FAILED_VALIDATE_RECORDS +from common.model import DataModel # from src.common.mongo_dao import MongoDao # needs to modify for dev database test diff --git a/src/test/validators/validate_required_props_test.py b/src/test/validators/validate_required_props_test.py index fe31cabf..de883ad8 100644 --- a/src/test/validators/validate_required_props_test.py +++ b/src/test/validators/validate_required_props_test.py @@ -5,10 +5,10 @@ current_directory = os.getcwd() sys.path.insert(0, current_directory + '/src') -from src.metadata_validator import MetaDataValidator -from src.common.constants import STATUS_ERROR, STATUS_PASSED, STATUS_WARNING, ERRORS, WARNINGS -from src.common.error_messages import FAILED_VALIDATE_RECORDS -from src.common.model import DataModel +from metadata_validator import MetaDataValidator +from common.constants import STATUS_ERROR, STATUS_PASSED, STATUS_WARNING, ERRORS, WARNINGS +from common.error_messages import FAILED_VALIDATE_RECORDS +from common.model import DataModel @pytest.fixture diff --git a/src/unit_test/test_mongo_dao_find_organization.py b/src/unit_test/test_mongo_dao_find_organization.py deleted file mode 100644 index b5cab3cb..00000000 --- a/src/unit_test/test_mongo_dao_find_organization.py +++ /dev/null @@ -1,367 +0,0 @@ -""" -Unit tests for MongoDao.find_organization_name_by_study_id method -Tests cover success cases, error handling, and edge cases -""" - -import unittest -from unittest.mock import MagicMock, patch -from common.mongo_dao import MongoDao -from common.constants import STUDY_COLLECTION, ORGANIZATION_COLLECTION -from pymongo import errors - - -class TestFindOrganizationNameByStudyId(unittest.TestCase): - """Test cases for find_organization_name_by_study_id method""" - - def setUp(self): - """Set up test fixtures""" - self.study_id = "study_123" - self.program_id = "program_456" - self.organization_name = "Test Organization" - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_success(self, mock_client_class): - """Test successful retrieval of organization name""" - # Setup mocks - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - # Mock study query result - mock_study_result = { - '_id': self.study_id, - 'name': 'Test Study', - 'programID': self.program_id - } - mock_study_collection.find_one.return_value = mock_study_result - - # Mock organization query result - mock_org_result = { - '_id': self.program_id, - 'name': self.organization_name - } - mock_org_collection.find_one.return_value = mock_org_result - - # Create instance - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - # Execute - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - # Verify - self.assertIsNotNone(result) - self.assertEqual(result, [self.organization_name]) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_with_special_chars(self, mock_client_class): - """Test organization name retrieval with special characters""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - special_org_name = "University of Science & Technology (UST) - Division" - - mock_study_collection.find_one.return_value = { - '_id': self.study_id, - 'programID': self.program_id - } - mock_org_collection.find_one.return_value = { - '_id': self.program_id, - 'name': special_org_name - } - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - self.assertEqual(result, [special_org_name]) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_study_not_found(self, mock_client_class): - """Test when study is not found""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - mock_study_collection.find_one.return_value = None - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - # When study is not found, method catches the error and returns None - result = mongo_dao.find_organization_name_by_study_id("nonexistent_study") - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_program_not_found(self, mock_client_class): - """Test when program/organization is not found""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mock_study_collection.find_one.return_value = { - '_id': self.study_id, - 'programID': self.program_id - } - # Organization not found - mock_org_collection.find_one.return_value = None - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - # When organization is not found, method catches the error and returns None - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_pymongo_error_on_study(self, mock_client_class): - """Test PyMongoError when querying study collection""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - mock_study_collection.find_one.side_effect = errors.PyMongoError("Connection error") - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - # Should return None on error - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_pymongo_error_on_organization(self, mock_client_class): - """Test PyMongoError when querying organization collection""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mock_study_collection.find_one.return_value = { - '_id': self.study_id, - 'programID': self.program_id - } - mock_org_collection.find_one.side_effect = errors.PyMongoError("Connection error") - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - # Should return None on error - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_generic_exception_on_study(self, mock_client_class): - """Test generic Exception when querying study collection""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - mock_study_collection.find_one.side_effect = Exception("Unexpected error") - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - # Should return None on error - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_generic_exception_on_organization(self, mock_client_class): - """Test generic Exception when querying organization collection""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mock_study_collection.find_one.return_value = { - '_id': self.study_id, - 'programID': self.program_id - } - mock_org_collection.find_one.side_effect = Exception("Unexpected error") - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - # Should return None on error - self.assertIsNone(result) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_empty_organization_name(self, mock_client_class): - """Test when organization name field is empty""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mock_study_collection.find_one.return_value = { - '_id': self.study_id, - 'programID': self.program_id - } - mock_org_collection.find_one.return_value = { - '_id': self.program_id, - 'name': "" - } - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id(self.study_id) - - self.assertEqual(result, [""]) - - @patch('common.mongo_dao.MongoClient') - def test_find_organization_name_multiple_studies_same_program(self, mock_client_class): - """Test correct organization retrieval for multiple studies with same program""" - mock_client = MagicMock() - mock_client_class.return_value = mock_client - - mock_db = MagicMock() - mock_client.__getitem__.return_value = mock_db - - mock_study_collection = MagicMock() - mock_org_collection = MagicMock() - - def db_getitem(key): - if key == STUDY_COLLECTION: - return mock_study_collection - elif key == ORGANIZATION_COLLECTION: - return mock_org_collection - return MagicMock() - - mock_db.__getitem__.side_effect = db_getitem - - mock_study_collection.find_one.return_value = { - '_id': "study_1", - 'programID': "program_shared" - } - mock_org_collection.find_one.return_value = { - '_id': "program_shared", - 'name': "Shared Organization" - } - - mongo_dao = MongoDao("mongodb://localhost:27017", "test_db") - - result = mongo_dao.find_organization_name_by_study_id("study_1") - - self.assertEqual(result, ["Shared Organization"]) - -if __name__ == '__main__': - unittest.main() diff --git a/src/unit_test/test_s3_list_object.py b/src/unit_test/test_s3_list_object.py deleted file mode 100644 index 4e48cb64..00000000 --- a/src/unit_test/test_s3_list_object.py +++ /dev/null @@ -1,41 +0,0 @@ -#!/usr/bin/env python3 -import os -from bento.common.utils import get_logger, get_md5 -from bento.common.s3 import S3Bucket -from boto3.s3.transfer import TransferConfig -from common.constants import STATUS, BATCH_TYPE_METADATA, DATA_COMMON_NAME, ERRORS, DB, \ - SUCCEEDED, ERRORS, S3_DOWNLOAD_DIR, SQS_NAME, FILE_ID, BATCH_STATUS_LOADED, \ - BATCH_STATUS_REJECTED -from common.utils import cleanup_s3_download_dir, get_exception_msg - -TRANSFER_UNIT_MB = 1024 * 1024 -MULTI_PART_THRESHOLD = 100 * TRANSFER_UNIT_MB -MULTI_PART_CHUNK_SIZE = MULTI_PART_THRESHOLD -PARTS_LIMIT = 900 -SINGLE_PUT_LIMIT = 4_500_000_000 -BUCKET_OWNER_ACL = 'bucket-owner-full-control' - -def test_s3_list(): - bucket = S3Bucket("crdcdh-test-submission") - key = "6681e23e-c091-40b0-9dfe-b1e415d97cd7/1f42b5f1-5ea4-4923-a9bb-f496c63362ce/file/Archive.zip" - fileName="/Users/gup2/workspace/testdata/PRECINCT01-2023-08-21/Archive.zip" - org_md5 = get_md5(fileName) - org_size = os.path.getsize(fileName) - print(f"Orig MD5: {org_md5}, Orig_size: {org_size}") - _upload_obj(bucket, fileName, key, org_size) - fileList = bucket.bucket.objects.filter(Prefix="6681e23e-c091-40b0-9dfe-b1e415d97cd7/1f42b5f1-5ea4-4923-a9bb-f496c63362ce/file/Archive.zip") - for file in fileList: - file_dict = {"fileName":file.key.split('/')[-1], "size": file.size, "md5": file.e_tag.strip('"')} - print(file_dict) - -def _upload_obj(bucket, org_url, key, org_size): - parts = int(org_size) // MULTI_PART_CHUNK_SIZE - chunk_size = MULTI_PART_CHUNK_SIZE if parts < PARTS_LIMIT else int(org_size) // PARTS_LIMIT - t_config = TransferConfig(multipart_threshold=MULTI_PART_THRESHOLD,multipart_chunksize=chunk_size) - with open(org_url, 'rb') as stream: - bucket._upload_file_obj( key, stream, t_config, {'ACL': BUCKET_OWNER_ACL}) - -# _upload_file_obj(self, key, data, config=None, extra_args={'ACL': BUCKET_OWNER_ACL}): - -if __name__ == '__main__': - test_s3_list() \ No newline at end of file diff --git a/src/validator.py b/src/validator.py index 14aeff95..6c893a37 100644 --- a/src/validator.py +++ b/src/validator.py @@ -6,7 +6,7 @@ from bento.common.utils import get_logger, LOG_PREFIX # from bento.common.sqs import Queue from common.sqs_queue import Queue -from common.constants import SQS_NAME, SERVICE_TYPE, SERVICE_TYPE_ESSENTIAL, \ +from common.constants import SQS_NAME, SERVICE_TYPE, SERVICE_TYPE_ESSENTIAL, \ SERVICE_TYPE_FILE, SERVICE_TYPE_METADATA, SERVICE_TYPE_EXPORT, SERVICE_TYPE_PV_PULLER from common.utils import get_exception_msg from config import Config @@ -14,7 +14,7 @@ from file_validator import fileValidate from metadata_validator import metadataValidate from metadata_export import metadata_export -from pv_puller import pull_pv_lists +from pv_puller_v2 import pull_pv_lists_v2 DATA_RECORDS_SEARCH_INDEX = "submissionID_nodeType_nodeID" DATA_RECORDS_CRDC_SEARCH_INDEX = "dataCommons_nodeType_nodeID" @@ -63,7 +63,8 @@ def controller(): if configs[SERVICE_TYPE] == SERVICE_TYPE_PV_PULLER and not mongo_dao.set_search_cde_index(CDE_SEARCH_INDEX): log.error("Failed to set cde index!") - return 1 + return 1 + except Exception as e: log.exception(e) @@ -80,7 +81,7 @@ def controller(): elif configs[SERVICE_TYPE] == SERVICE_TYPE_EXPORT: metadata_export(configs, job_queue, mongo_dao) elif configs[SERVICE_TYPE] == SERVICE_TYPE_PV_PULLER: - pull_pv_lists(configs, mongo_dao) + pull_pv_lists_v2(configs, mongo_dao) else: log.error(f'Invalid service type: {configs[SERVICE_TYPE]}!') return 1 diff --git a/src/x_submission_validator.py b/src/x_submission_validator.py index 23188783..1c6c4f67 100644 --- a/src/x_submission_validator.py +++ b/src/x_submission_validator.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 import json from bento.common.utils import get_logger -from common.constants import ADDITION_ERRORS, STATUS_ERROR, FAILED, STATUS_PASSED, STATUS, UPDATED_AT, DATA_COMMON_NAME, \ +from common.constants import ADDITION_ERRORS, STATUS_ERROR, STATUS_PASSED, STATUS, UPDATED_AT, DATA_COMMON_NAME, \ NODE_TYPE, NODE_ID, VALIDATED_AT, ORIN_FILE_NAME, STUDY_ID, ID, SUBMISSION_STATUS_SUBMITTED, SUBMISSION_REL_STATUS_RELEASED from common.utils import current_datetime, create_error @@ -27,11 +27,11 @@ def validate(self, submission_id): submission_id (str): The ID of the submission to validate Returns: - str: Validation status - either FAILED or other status constants as appropriate + str: Validation status - either STATUS_ERROR or other status constants as appropriate Behavior: - Uses the submission's dataCommons field to scope validation - - If submission has no dataCommons: Returns FAILED with error message + - If submission has no dataCommons: Returns STATUS_ERROR with error message - Cross-validation only compares submissions within the same study AND data commons scope @@ -43,19 +43,19 @@ def validate(self, submission_id): if not submission: msg = f'Invalid submissionID, no submission found, {submission_id}!' self.log.error(msg) - return FAILED + return STATUS_ERROR # Get data commons from submission data_commons = submission.get(DATA_COMMON_NAME) if not data_commons: msg = f'Invalid submission, no dataCommons found, {submission_id}!' self.log.error(msg) - return FAILED + return STATUS_ERROR if submission.get(STATUS) not in [SUBMISSION_STATUS_SUBMITTED, SUBMISSION_REL_STATUS_RELEASED]: msg = f'Invalid submission, wrong status, {submission_id}!' self.log.error(msg) - return FAILED + return STATUS_ERROR self.submission = submission self.data_commons = data_commons @@ -67,7 +67,7 @@ def validate(self, submission_id): if start_index == 0 and (not data_records or len(data_records) == 0): msg = f'No metadata to be validated.' self.log.error(msg) - return FAILED + return STATUS_ERROR count = len(data_records) validated_count += self.validate_nodes(data_records, submission_id)