Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AN-214 Redirect workflow existence queries from metadata to summary #7575

Merged
merged 6 commits into from
Oct 25, 2024

Conversation

aednichols
Copy link
Collaborator

Description

We make this query constantly, looks like the single most frequent one against metadata. All it does is check whether the workflow ID is valid by checking whether >1 metadatum exists for it.

We already started down the path of checking summary instead of metadata, see #4617. It just makes way more sense to me to check a table with 77M rows than 36B.

select 
  exists(
    select 
      `CALL_FQN`, 
      `METADATA_KEY`, 
      `WORKFLOW_EXECUTION_UUID`, 
      `METADATA_TIMESTAMP`, 
      `JOB_SCATTER_INDEX`, 
      `METADATA_JOURNAL_ID`, 
      `JOB_RETRY_ATTEMPT`, 
      `METADATA_VALUE_TYPE`, 
      `METADATA_VALUE` 
    from 
      `METADATA_ENTRY` 
    where 
      `WORKFLOW_EXECUTION_UUID` = '602a4913-d666-4182-b2f1-242fbda817d2'
  );

It is potentially implicated in the 2021 database migration that failed at the very end, you can see a bunch of them in this screenshot (2021-11-09):

Screen Shot 2021-11-09 at 1 14 03 AM
Lock wait timeout exceeded; try restarting transaction
  [for Statement "RENAME TABLE `cromwell`.`METADATA_ENTRY` TO `cromwell`.`_METADATA_ENTRY_old`,
  `cromwell`.`_METADATA_ENTRY_new` TO `cromwell`.`METADATA_ENTRY`"]
  at /usr/bin/pt-online-schema-change line 10922.

Slack link to contemporary discussion.
Contemporary analysis in JIRA.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@aednichols aednichols requested a review from a team as a code owner October 24, 2024 13:59
@aednichols aednichols changed the title Aen an 213 AN-213 Redirect workflow existence queries from metadata to summaries Oct 24, 2024
@aednichols aednichols changed the title AN-213 Redirect workflow existence queries from metadata to summaries AN-213 Redirect workflow existence queries from metadata to summary Oct 24, 2024
@aednichols aednichols requested a review from mcovarr October 24, 2024 14:08
@aednichols
Copy link
Collaborator Author

aednichols commented Oct 24, 2024

Note:

As it stands, this PR could return false positives for a workflow that has been archived from metadata but still has a summary. I think this is already true for the function converted in #4617

I'm not sure whether it's a problem per se, but certainly notable.

Update: based on the analysis in #7575 (comment) the old function validateWorkflowIdInMetadata already returns true for archived workflows.

@mcovarr
Copy link
Contributor

mcovarr commented Oct 24, 2024

Serious cobwebs here, but could there be false negative issues here if metadata rows exist for the workflow but a summary row hasn't been created for it yet? Like if a workflow is submitted and status is polled shortly thereafter.

@aednichols
Copy link
Collaborator Author

We definitely have had complaints about the existing code that a just-submitted workflow returns a 404 on status endpoint. But Terra doesn't care because Rawls polls every 3 minutes or whatever and it catches up.

The summarization interval is 1 second, both by default and in Terra. So the window for issues would be extremely small.

@salonishah11
Copy link
Contributor

As it stands, this PR could return false positives for a workflow that has been archived from metadata but still has a summary.

@aednichols since we ran out of time mobbing meeting earlier today, what do you think about adding a status check in the query for this situation?

@aednichols aednichols changed the title AN-213 Redirect workflow existence queries from metadata to summary AN-214 Redirect workflow existence queries from metadata to summary Oct 24, 2024
@aednichols
Copy link
Collaborator Author

@salonishah11 it looks like we are already doing an archive status check immediately after the existence check:

validateWorkflowIdInMetadata(possibleWorkflowId, serviceRegistryActor) flatMap { id =>
/*
for requests made to one of /metadata, /logs or /outputs endpoints, perform an additional check to see
if metadata for the workflow has been archived and deleted or not (as they interact with metadata table)
*/
request(id) match {
case m: BuildWorkflowMetadataJsonWithOverridableSourceAction => checkIfMetadataDeletedAndRespond(id, m)
case _ => serviceRegistryActor.ask(request(id)).mapTo[MetadataJsonResponse]
}
}

Today on Prod when I open the page for an archived workflow like this one (canary workspace) Cromwell responds with the following JSON:

{
    "id": "63ac4bc1-388c-430d-86f5-d123a7073e3c",
    "message": "Cromwell has archived this workflow's metadata according to the lifecycle policy. The workflow completed at 2023-08-30T16:39:09.168Z, which was 36384533045 milliseconds ago. It is available in the archive bucket, or via a support request in the case of a managed instance.",
    "metadataArchiveStatus": "ArchivedAndDeleted"
}

It comes from checkIfMetadataDeletedAndRespond so it seems like the workflow is somehow passing the validateWorkflowIdInMetadata existence check despite being archived.

@aednichols aednichols merged commit c4093a0 into develop Oct 25, 2024
36 of 37 checks passed
@aednichols aednichols deleted the aen_an_213 branch October 25, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants