-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WX-1670] Don't allocate job tokens for hog groups experiencing quota exhaustion #7520
Conversation
@@ -145,6 +181,7 @@ class JobTokenDispenserActor(override val serviceRegistryActor: ActorRef, | |||
val hogGroupCounts = | |||
nextTokens.groupBy(t => t.queuePlaceholder.hogGroup).map { case (hogGroup, list) => s"$hogGroup: ${list.size}" } | |||
log.info(s"Assigned new job $dispenserType tokens to the following groups: ${hogGroupCounts.mkString(", ")}") | |||
// System.out.println(s"##### FIND ME $dispenserType tokens for actors: ${nextTokens.map(t => t.queuePlaceholder.actor.path).mkString(",")}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove this after testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now removed
|
||
class GroupMetricsActor(engineDbInterface: EngineSqlDatabase) extends Actor with ActorLogging { | ||
|
||
implicit val ec: MessageDispatcher = context.system.dispatchers.lookup(Dispatcher.EngineDispatcher) | ||
|
||
final private val QUOTA_EXHAUSTION_THRESHOLD_IN_SECS = 15 * 60 // 15 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for reviewers - what should be this threshold? Is 15 mins good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to make this configurable? I think 15m is a good place to start, if anything a little long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be that hard. I can try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configurable sounds good, yeah.
How long is our polling interval for cloud jobs? It seems like the quota backoff time should be more than the poll interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long is our polling interval for cloud jobs?
It seems it initially starts with 30 seconds and the max interval is 10 mins. So maybe 15 mins is a good starting point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, sounds great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great approach!
case Success(GetQuotaExhaustedGroupsFailure(errorMsg)) => | ||
log.error(s"Failed to fetch quota exhausted groups. Error: $errorMsg") | ||
dispense(n, List.empty) | ||
case Failure(exception) => | ||
log.error(s"Unexpected failure while fetching quota exhausted groups. Error: ${exception.getMessage}") | ||
dispense(n, List.empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the fail-safe approach here.
extends Iterator[LeasedActor] { | ||
final class RoundRobinQueueIterator(initialTokenQueue: List[TokenQueue], | ||
initialPointer: Int, | ||
quotaExhaustedGroups: List[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally to this class, there's nothing specific about quota. We could suggest future extension with a generic name.
quotaExhaustedGroups: List[String] | |
excludedGroups: List[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I do think we want both a changelog entry and a Terra release note for this.
Since this is a big change to the way we start jobs, I'm wondering if we want to include an "off switch" in the initial release. If we discovered a problem with this behavior and want to quickly revert to the old behavior, can we do that by setting the config to a 0 minute threshold? Should we build in an enabled
flag for this behavior in config?
@@ -17,15 +17,33 @@ class RoundRobinQueueIteratorSpec extends TestKitSuite with AnyFlatSpecLike with | |||
val tokenEventLogger = NullTokenEventLogger | |||
|
|||
it should "be empty if there's no queue" in { | |||
new RoundRobinQueueIterator(List.empty, 0).hasNext shouldBe false | |||
new RoundRobinQueueIterator(List.empty, 0, List.empty).hasNext shouldBe false | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests!
@jgainerdewar yes I had thought about that and setting the config to 0 should work. But I like your suggestion about having an actual config value like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
CHANGELOG.md
Outdated
@@ -2,6 +2,14 @@ | |||
|
|||
## 88 Release Notes | |||
|
|||
### New feature: Prevent Job start during Cloud Quota exhaustion | |||
|
|||
This optional feature prevents Cromwell from starting new jobs in a billing project that is currently experiencing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say "in a hog group" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace billing project
with hog group
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of the belief that we should not use "hog group" in any new, external-facing places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think would be clearer? I don't think "billing project" is either accurate in Terra or useful to other users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just group
might work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I usually just use group.
CHANGELOG.md
Outdated
cloud quota exhaustion. Jobs will be started once the project's quota becomes available. To enable this feature, | ||
set `quota-exhaustion-job-start-control.enabled` to true. | ||
|
||
Note: Jobs that are being restarted will not be affected by this feature, even if it is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this note will confuse people, since there are so many kinds of job restarts. I think we can probably just take it out, users probably need to understand a lot about Cromwell internals to even have this question. 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I added this so that users know that restarts won't affect their jobs not "being restarted" but I didn't know there could be different kind of restarts. I am happy to remove this.
@@ -270,6 +270,15 @@ system { | |||
# token-log-interval-seconds = 300 | |||
} | |||
|
|||
# If enabled, Cromwell will not allocate new execution tokens to jobs whose hog groups is actively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nit - "hog groups are actively"
if (tokenQueues.nonEmpty) { | ||
// don't fetch cloud quota exhausted groups for token dispenser allocating 'restart' tokens | ||
if (dispenserType == "execution" && groupMetricsActor.nonEmpty) { | ||
groupMetricsActor.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid unprotected get
s - could use a case statement instead here, like
(dispenserType, groupMetricsActor) match {
case ("execution", Some(a)) => a.ask...
case _ => dispense(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add groupMetricsActor.nonEmpty
check on L44 check before doing the .get
. Would you still recommend to use pattern matching here instead of what I have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's true that the current code state is guaranteed to not cause a problem, but the compiler isn't checking that, and it's easy for the association between the nonEmpty
check and the .get
to degrade over time as the code is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I will change it not use .get
and refactor the way you suggested 👍
Jira: https://broadworkbench.atlassian.net/browse/WX-1670
Description
Cromwell will use information provided in the new GROUP_METRICS_ENTRY table to allocate new tokens for job requests whose hog group is not experiencing any cloud quota exhaustion. Note that this will be applied to jobs seeking "execution" tokens. Jobs seeking "restart" tokens are not affected by this change.
TODO:
Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes