Refactor Import Automation #353

mbthornton-lbl · 2025-01-16T17:26:14Z

This PR address the issues described in #332
providing a re-write of the import automation code.

The changes in project import behavior include:

Separating import file mapping, NMDC ID creation, and Database update operations
Re-using minted NMDC IDs
Checking for existing Workflow Executions
Validating import data via the json:validate endpoint
Create update query for DataGeneration.has_output
Making DB update optional vs. writing json output

The legacy activity_mapper package is replaced by a simplified import_mapper consisting of

ImportMapper
FileMapping

The import_projects command has been re-written to separate file mapping / file linking / and NMDC Database update operations.

Additional Changes

fix typos in import.yaml files
Add test data files in tests/import_project_dir
Unit tests for import_mapper

Additional changes - not directly related to import

tests for JGI file staging accessible to the test runner
There are a number of failing staging tests that have been commented out for now

…/github.com/microbiomedata/nmdc_automation into 332-issues-with-rerunning-import-automation

nmdc_automation/api/nmdcapi.py

aclum · 2025-01-31T00:51:42Z

nmdc_automation/import_automation/import_mapper.py

+            if object_type in ids:
+                return ids[object_type]
+            else:
+                workflow_obj_id = self.runtime_api.minter(object_type) + ".1"


this is missing logic check if should mint an new ID + .1 versus increment an existing ID

aclum · 2025-01-31T01:06:21Z

nmdc_automation/jgi_file_staging/jgi_file_metadata.py

                                  proj['apType'] in ["Metagenome Analysis", "Metatranscriptome Analysis"]]
+    else:
+        ap_type_gold_analysis_data = []


consider making this an error @mflynn-lanl

aclum · 2025-01-31T01:11:06Z

nmdc_automation/run_process/run_import.py

+            data_generation_update_query['updates'].append(update)
+
+        # Already has nmdc output
+        elif dg_output and dg_output[0].startswith('nmdc:dobj'):


This would be more robust if there was join on the data_object set record, making sure the 'name' of the data_object_set record matches what is in the import directory.

nmdc_automation/run_process/run_import.py

tests/test_jgi_file_staging/test_file_metadata.py

a data object is output by either a workflow execution or datq generation ID, both of with are planned processes

mbthornton-lbl added 30 commits December 17, 2024 16:37

add get_planned_process api wrapper method

a653094

add link_sequencing_data_file method

12cd1ca

simplify map_sequenc_file - we only support unique metagenome raw reads

ed246e9

add find_data_object method

e13a238

implement a simplified mapper and basic file mappings

232f69f

add str method

8e20f02

updates to run_import

f7946dc

version updates

5c9a04e

update mapper

3694683

checkpoint

922f458

add unit test and test fixtures for get_or_create_minted_id

b69bcf4

add unit tests

508dfb1

get file mapping working

4e1620d

get import file destination working

22e168c

update logging

c747fa3

add data file linking

0db4c82

add basic data object. creation

892814b

Merge branch 'main' into 332-issues-with-rerunning-import-automation

eb7711c

Merge branch 'main' into 332-issues-with-rerunning-import-automation

d2b6e52

logging

d170a33

Merge branch '332-issues-with-rerunning-import-automation' of https:/…

3bf9902

…/github.com/microbiomedata/nmdc_automation into 332-issues-with-rerunning-import-automation

get db_workflow_items_by_type working

6254d2a

poetry update

9f1ffca

add validate_metadata api method

b91ffee

fix unit test failure

22137ce

Merge branch 'main' into 332-issues-with-rerunning-import-automation

8e77df1

fix log level issues

2b09525

distinguish between raw and filtered reads

f309a83

Get basic workflow exec record creation working

6baed1d

fix find_planned_processes to return results as list

b06eec9

mbthornton-lbl added 6 commits January 24, 2025 11:27

update to use workflows endpoint

262e2b4

Merge branch 'main' into 332-issues-with-rerunning-import-automation

243d46f

Add update query to output

d3abbd9

remove unused imports

bc38499

add is_multiple to FileMappings

4cafedc

fix unit tests

d66e648

mbthornton-lbl requested review from AmitBinf and aclum January 30, 2025 17:08

mbthornton-lbl added 5 commits January 30, 2025 11:33

Add is_multiple test files, implement file renaming

f0ff438

handle binning files

11b1709

delete obsolete activity_mapper package and tests

ca6a0e7

removed obsolete code from utile

e3ff59f

update pytest results and badges

26e9bd9

mbthornton-lbl marked this pull request as ready for review January 30, 2025 21:06

mbthornton-lbl requested review from mflynn-lanl and kaijli January 30, 2025 21:06

aclum reviewed Jan 31, 2025

View reviewed changes

nmdc_automation/api/nmdcapi.py Outdated Show resolved Hide resolved

aclum reviewed Jan 31, 2025

View reviewed changes

nmdc_automation/api/nmdcapi.py Show resolved Hide resolved

aclum reviewed Jan 31, 2025

View reviewed changes

nmdc_automation/run_process/run_import.py Outdated Show resolved Hide resolved

aclum reviewed Jan 31, 2025

View reviewed changes

tests/test_jgi_file_staging/test_file_metadata.py Outdated Show resolved Hide resolved

mbthornton-lbl and others added 7 commits January 31, 2025 12:22

remove @refresh_token decorators where authentication is not required

44fce99

Merge branch 'main' into 332-issues-with-rerunning-import-automation

a277e05

Merge branch 'main' into 332-issues-with-rerunning-import-automation

c570304

updated test coverage artefacts

89aa34b

Use NMDC Process ID for file mappings

5725430

a data object is output by either a workflow execution or datq generation ID, both of with are planned processes

solving import issues with _proteins.faa

036c73b

This fixes #361 to make sure we are picking up correct proteins file

64c4ee6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Import Automation #353

Refactor Import Automation #353

mbthornton-lbl commented Jan 16, 2025 •

edited

Loading

aclum Jan 31, 2025 •

edited

Loading

aclum Jan 31, 2025

aclum Jan 31, 2025

Refactor Import Automation #353

Are you sure you want to change the base?

Refactor Import Automation #353

Conversation

mbthornton-lbl commented Jan 16, 2025 • edited Loading

aclum Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

aclum Jan 31, 2025

Choose a reason for hiding this comment

aclum Jan 31, 2025

Choose a reason for hiding this comment

mbthornton-lbl commented Jan 16, 2025 •

edited

Loading

aclum Jan 31, 2025 •

edited

Loading