Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CROM-6718: FR: Call Caching - Add flag for minimizing chance of GCP cross-region network egress charges being incurred #6432

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

wnojopra
Copy link

@wnojopra wnojopra commented Jul 2, 2021

Extending mcovarr's work in #6366 . Big shoutout to mcovarr!!!

[Per @mbookman]
This pull request is an initial update to address:

CROM-6718: FR: Add flag for minimizing chance of GCP cross-region network egress charges being incurred

This PR specifically focuses on the risks of egress charges incurred due to call caching. The framing of the approach here, which is a bit broader than originally noted in CROM-6718, is:
Make call caching location-aware, prioritizing copies that minimize egress charges.
Add a workflow option enabling control of what egress charges can be incurred for call cache copying.
The new workflow option would be:

call_cache_egress: [none, continental, global]

where the values affect whether call cache copies can incur egress charges:
none: only within-region copies are allowed, which generate no egress charges
continental: within content copies are allowed; within-content copies have reduced costs, such as $0.01 / GB in the US
global: copies across all regions are allowed. Cross-content egress charges can be much higher (ranging from $0.08 / GB up to $0.23 / GB)

CURRENT STATUS OF PR:

With the changes in this PR, Cromwell successfully checks the location of the source and destination file to be copied, compares the location, and makes a decision of whether or not it should be copied based on the call_cache_egress option. If it should be copied, the files are copied as normal. If it should not be copied, the cache attempt fails and the workflow runs instead.

Copy link
Contributor

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall comments:

  • Destination location should really be checked once per root workflow and not for every job.
  • Codecov has identified a lot of untested code. Some of this we'll probably want to integration test with Centaur, but there's also quite a bit that could easily be reached with unit tests (e.g. PipelinesApiBackendCacheHitCopyingActor.scala)
  • Lots of debug logging, printlns and commented out code to be cleaned up / removed
  • Needs docs and a CHANGELOG.md entry (which might already be in your other PR, I don't remember)

}
}

private def issueDestinationLocationCommand(sourceLocation: String, copyCommand: CopyOutputsCommand): State = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location check should happen once at the root workflow level rather than once for every job in the workflow. The destination bucket will be the same for every job in the workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Miguel! Do you have a suggestion on how best to achieve this? Its not clear to me how to avoid doing this for every job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Miguel today we've agreed to leave the destination location checking as-is (at the job level). A few reasons for this are:

  1. At the workflow level, we don't know the backend type.
  2. We could cache locations per bucket at the job level, but it would add unnecessary complication to add a caching layer across all JobActors.

@wnojopra
Copy link
Author

wnojopra commented Aug 4, 2021

I did performance testing by with a wdl that has 50 outputs. On a cache hit, it attempts to copy the 50 files, so with my change it would also perform 50*2 location lookups. I ran this wdl 10 times without my changes, and 10 times with my changes, with call_cache_egress set to "none". For each run, I looked at the timestamps for when the job hashing job is initialized , to when the workflow completes.

Without my changes, the difference between those timestamps was on average 9 seconds.

With my changes, the difference between those timestamps was on average 16 seconds.

@wnojopra
Copy link
Author

wnojopra commented Aug 4, 2021

There is one test failing in Travis CI, but it looks unrelated to my change. I believe this PR is ready for another review @mcovarr

Copy link
Contributor

@kshakir kshakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the next steps for this PR and its sibling are being discussed outside of GitHub

@wnojopra
Copy link
Author

wnojopra commented Aug 9, 2021

I believe the next steps for this PR and its sibling are being discussed outside of GitHub

Somewhat. We have a thread open with Kyle on high-level details of performance, complexity, and support. I'd still appreciate a review on the code submitted so far. Thanks!

@mcovarr mcovarr removed their request for review April 4, 2022 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants