Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Import Automation #353

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

mbthornton-lbl
Copy link
Contributor

@mbthornton-lbl mbthornton-lbl commented Jan 16, 2025

This PR address the issues described in #332
providing a re-write of the import automation code.

The changes in project import behavior include:

  • Separating import file mapping, NMDC ID creation, and Database update operations
  • Re-using minted NMDC IDs
  • Checking for existing Workflow Executions
  • Validating import data via the json:validate endpoint
  • Create update query for DataGeneration.has_output
  • Making DB update optional vs. writing json output

The legacy activity_mapper package is replaced by a simplified import_mapper consisting of

  • ImportMapper
  • FileMapping

The import_projects command has been re-written to separate file mapping / file linking / and NMDC Database update operations.

Additional Changes

  • fix typos in import.yaml files
  • Add test data files in tests/import_project_dir
  • Unit tests for import_mapper

Additional changes - not directly related to import

  • tests for JGI file staging accessible to the test runner
  • There are a number of failing staging tests that have been commented out for now

@mbthornton-lbl mbthornton-lbl marked this pull request as ready for review January 30, 2025 21:06
if object_type in ids:
return ids[object_type]
else:
workflow_obj_id = self.runtime_api.minter(object_type) + ".1"
Copy link
Contributor

@aclum aclum Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing logic check if should mint an new ID + .1 versus increment an existing ID

proj['apType'] in ["Metagenome Analysis", "Metatranscriptome Analysis"]]
else:
ap_type_gold_analysis_data = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making this an error @mflynn-lanl

data_generation_update_query['updates'].append(update)

# Already has nmdc output
elif dg_output and dg_output[0].startswith('nmdc:dobj'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be more robust if there was join on the data_object set record, making sure the 'name' of the data_object_set record matches what is in the import directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

issues with rerunning import automation
3 participants