Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WX-1670] Don't allocate job tokens for hog groups experiencing quota exhaustion #7520

Merged
merged 18 commits into from
Sep 9, 2024

Conversation

salonishah11
Copy link
Contributor

@salonishah11 salonishah11 commented Aug 30, 2024

Jira: https://broadworkbench.atlassian.net/browse/WX-1670

Description

Cromwell will use information provided in the new GROUP_METRICS_ENTRY table to allocate new tokens for job requests whose hog group is not experiencing any cloud quota exhaustion. Note that this will be applied to jobs seeking "execution" tokens. Jobs seeking "restart" tokens are not affected by this change.

TODO:

  • test changes in BEE
  • fix unit tests
  • add new unit tests
  • update Changelog?

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@@ -145,6 +181,7 @@ class JobTokenDispenserActor(override val serviceRegistryActor: ActorRef,
val hogGroupCounts =
nextTokens.groupBy(t => t.queuePlaceholder.hogGroup).map { case (hogGroup, list) => s"$hogGroup: ${list.size}" }
log.info(s"Assigned new job $dispenserType tokens to the following groups: ${hogGroupCounts.mkString(", ")}")
// System.out.println(s"##### FIND ME $dispenserType tokens for actors: ${nextTokens.map(t => t.queuePlaceholder.actor.path).mkString(",")}")
Copy link
Contributor Author

@salonishah11 salonishah11 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this after testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now removed


class GroupMetricsActor(engineDbInterface: EngineSqlDatabase) extends Actor with ActorLogging {

implicit val ec: MessageDispatcher = context.system.dispatchers.lookup(Dispatcher.EngineDispatcher)

final private val QUOTA_EXHAUSTION_THRESHOLD_IN_SECS = 15 * 60 // 15 minutes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for reviewers - what should be this threshold? Is 15 mins good?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to make this configurable? I think 15m is a good place to start, if anything a little long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be that hard. I can try that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configurable sounds good, yeah.

How long is our polling interval for cloud jobs? It seems like the quota backoff time should be more than the poll interval.

Copy link
Contributor Author

@salonishah11 salonishah11 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is our polling interval for cloud jobs?

It seems it initially starts with 30 seconds and the max interval is 10 mins. So maybe 15 mins is a good starting point?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, sounds great to me.

@salonishah11 salonishah11 changed the title [WX-1670] Don't allocate tokens for hog groups experiencing quota exhaustion [WX-1670] Don't allocate job tokens for hog groups experiencing quota exhaustion Aug 30, 2024
Copy link
Collaborator

@jgainerdewar jgainerdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great approach!

Comment on lines 149 to 154
case Success(GetQuotaExhaustedGroupsFailure(errorMsg)) =>
log.error(s"Failed to fetch quota exhausted groups. Error: $errorMsg")
dispense(n, List.empty)
case Failure(exception) =>
log.error(s"Unexpected failure while fetching quota exhausted groups. Error: ${exception.getMessage}")
dispense(n, List.empty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the fail-safe approach here.

extends Iterator[LeasedActor] {
final class RoundRobinQueueIterator(initialTokenQueue: List[TokenQueue],
initialPointer: Int,
quotaExhaustedGroups: List[String]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally to this class, there's nothing specific about quota. We could suggest future extension with a generic name.

Suggested change
quotaExhaustedGroups: List[String]
excludedGroups: List[String]

@salonishah11 salonishah11 marked this pull request as ready for review September 4, 2024 23:05
@salonishah11 salonishah11 requested a review from a team as a code owner September 4, 2024 23:05
Copy link
Collaborator

@jgainerdewar jgainerdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I do think we want both a changelog entry and a Terra release note for this.

Since this is a big change to the way we start jobs, I'm wondering if we want to include an "off switch" in the initial release. If we discovered a problem with this behavior and want to quickly revert to the old behavior, can we do that by setting the config to a 0 minute threshold? Should we build in an enabled flag for this behavior in config?

@@ -17,15 +17,33 @@ class RoundRobinQueueIteratorSpec extends TestKitSuite with AnyFlatSpecLike with
val tokenEventLogger = NullTokenEventLogger

it should "be empty if there's no queue" in {
new RoundRobinQueueIterator(List.empty, 0).hasNext shouldBe false
new RoundRobinQueueIterator(List.empty, 0, List.empty).hasNext shouldBe false
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

@salonishah11
Copy link
Contributor Author

Since this is a big change to the way we start jobs, I'm wondering if we want to include an "off switch" in the initial release. If we discovered a problem with this behavior and want to quickly revert to the old behavior, can we do that by setting the config to a 0 minute threshold? Should we build in an enabled flag for this behavior in config?

@jgainerdewar yes I had thought about that and setting the config to 0 should work. But I like your suggestion about having an actual config value like enabled instead which should make it more clear. Use the enabled flag would also be better so that if it is set to false the JobTokenDsispenserActor won't ask GroupMetricsActor about exhausted groups at all.

Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

CHANGELOG.md Outdated
@@ -2,6 +2,14 @@

## 88 Release Notes

### New feature: Prevent Job start during Cloud Quota exhaustion

This optional feature prevents Cromwell from starting new jobs in a billing project that is currently experiencing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say "in a hog group" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace billing project with hog group 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm of the belief that we should not use "hog group" in any new, external-facing places.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think would be clearer? I don't think "billing project" is either accurate in Terra or useful to other users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just group might work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I usually just use group.

CHANGELOG.md Outdated
cloud quota exhaustion. Jobs will be started once the project's quota becomes available. To enable this feature,
set `quota-exhaustion-job-start-control.enabled` to true.

Note: Jobs that are being restarted will not be affected by this feature, even if it is enabled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this note will confuse people, since there are so many kinds of job restarts. I think we can probably just take it out, users probably need to understand a lot about Cromwell internals to even have this question. 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I added this so that users know that restarts won't affect their jobs not "being restarted" but I didn't know there could be different kind of restarts. I am happy to remove this.

@@ -270,6 +270,15 @@ system {
# token-log-interval-seconds = 300
}

# If enabled, Cromwell will not allocate new execution tokens to jobs whose hog groups is actively
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit - "hog groups are actively"

if (tokenQueues.nonEmpty) {
// don't fetch cloud quota exhausted groups for token dispenser allocating 'restart' tokens
if (dispenserType == "execution" && groupMetricsActor.nonEmpty) {
groupMetricsActor.get
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid unprotected gets - could use a case statement instead here, like

(dispenserType, groupMetricsActor) match {
    case ("execution", Some(a)) => a.ask...
    case _ => dispense(...)
}

Copy link
Contributor Author

@salonishah11 salonishah11 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add groupMetricsActor.nonEmpty check on L44 check before doing the .get. Would you still recommend to use pattern matching here instead of what I have?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's true that the current code state is guaranteed to not cause a problem, but the compiler isn't checking that, and it's easy for the association between the nonEmpty check and the .get to degrade over time as the code is changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I will change it not use .get and refactor the way you suggested 👍

@salonishah11 salonishah11 merged commit c1d1302 into develop Sep 9, 2024
37 checks passed
@salonishah11 salonishah11 deleted the sps_use_new_quota_table branch September 9, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants