Skip to content

Conversation

@tych0
Copy link
Member

@tych0 tych0 commented Sep 9, 2025

We are interested in using memory.oom.cgroup, but need it to be set systemd because of [1], so let's set it. There are a few caveats, in no particular order:

A. systemd does not allow OOMPolicy to be set on units that have already
started, so we must do this in Apply() instead of Set().
B. As the comment suggests, OOMPolicy has three states (continue, stop,
kill), where kill maps to memory.oom.group=1, and continue maps to =0.
However, the bit about runc update doesn't quite make sense: the
values will only ever be expressed in terms of memory.oom.group, so we
only need to map the continue and kill values, which have direct
mappings.

Note that runc update here doesn't make sense anyway: because of (A),
we cannot update these values. Perhaps we should reject these updates
since systemd will? (Or maybe we try to update and just error out, in
the event that systemd eventually allows this? The kernel allows
updating it, the reason the systemd semantics have diverged is unclear.)
C. systemd only gained support for setting OOMPolicy on scopes in
versions >= 253; versions before this will fail.

So, let's add a bit allowing the setup of OOMPolicy to Apply(), and ignore it in Set() -> genV2ResourcesProperties() -> unifiedResToSystemdProps().

[1]: This arguably is more important than the debug-level warning would suggest: if someone does the equivalent of a systemctl daemon-reload, systemd will reset our manually-via-cgroupfs set value to 0, because we did not explicitly set it in the service / scope definition, meaning that
individual tasks will not actually oom the whole cgroup when they oom.

@tych0 tych0 force-pushed the memory-oom-group-support branch 2 times, most recently from d0609a7 to e2f6711 Compare September 9, 2025 17:23
// values for OOMPolicy (continue/stop).
fallthrough
// This was set before the unit started, so no need to
// warn about it here.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the warning should be louder (maybe just more explicitly stating systemd might override this on a daemon-reload) for other uses that hit it since systemd will stomp on it as you highlight, or is that only for a subset of things that systemd has a knob for? I realize that's a bit out of scope for this PR, but sort of a tricky thing for folks to debug down to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for memory.oom.group specifically, I think we can (have to?) assume that it was set up in Apply() viz. this patch (modulo bugs). But r.e. your comment, I wonder if we should make the logging in the default case at least a warning level, or maybe explicitly generate an error?

// to 0 in runc update, as there are two other possible
// values for OOMPolicy (continue/stop).
fallthrough
// This was set before the unit started, so no need to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, do we know this to actually be true?

i.e. we've made it so we also process memory.oom.group in Apply(), which iiuc is getting its config for this from when you setup a new manager, but for Set() we're sort of hoping the user already set it up initially and just avoid warning on it. The downside here then is if someone doesn't have it configured for Apply(), and then adds it later with Set(), systemd won't be in the loop and a daemon-reload will cause systemd to overwrite with whatever its OOMPolicy is set to be for the unit?

I guess I don't have a better answer for these sort of "must be done on unit creation" type settings unless we can warn iff we know the systemd prop isn't aligned already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for Set() we're sort of hoping the user already set it up initially and just avoid warning on it

Oh, I suppose what we should do is query the existing value, and warn/error if they mismatch? because you're right: if someone does a runc update with a new value for this, we cannot actually set it.

@halaney
Copy link

halaney commented Sep 10, 2025

super nit:

[1]: This arguably is more important than the debug-level warning would
suggest: if someone does the equivalent of a `systemctl daemon-reload`,
systemd will reset our manually-via-cgroupfs set value to 0, because we did
not explicitly set it in the service / scope definition, meaning

left me on the edge of my seat waiting for that conclusion!

We are interested in using memory.oom.cgroup, but need it to be set systemd
because of [1], so let's set it. There are a few caveats, in no particular
order:

A. systemd does not allow OOMPolicy to be set on units that have already
   started, so we must do this in Apply() instead of Set().
B. As the comment suggests, OOMPolicy has three states (continue, stop,
   kill), where kill maps to memory.oom.group=1, and continue maps to =0.
   However, the bit about `runc update` doesn't quite make sense: the
   values will only ever be expressed in terms of memory.oom.group, so we
   only need to map the continue and kill values, which have direct
   mappings.

   Note that `runc update` here doesn't make sense anyway: because of (A),
   we cannot update these values. Perhaps we should reject these updates
   since systemd will? (Or maybe we try to update and just error out, in
   the event that systemd eventually allows this? The kernel allows
   updating it, the reason the systemd semantics have diverged is unclear.)
C. systemd only gained support for setting OOMPolicy on scopes in versions
   >= 253; versions before this will fail.

So, let's add a bit allowing the setup of OOMPolicy to Apply(), and ignore
it in Set() -> genV2ResourcesProperties() -> unifiedResToSystemdProps().

[1]: This arguably is more important than the debug-level warning would
suggest: if someone does the equivalent of a `systemctl daemon-reload`,
systemd will reset our manually-via-cgroupfs set value to 0, because we did
not explicitly set it in the service / scope definition, meaning that
individual tasks will not actually oom the whole cgroup when they oom.

Co-authored-by: Ethan Adams <[email protected]>
Signed-off-by: Tycho Andersen <[email protected]>
@tych0 tych0 force-pushed the memory-oom-group-support branch from e2f6711 to 0080fa4 Compare September 10, 2025 18:10
@tych0
Copy link
Member Author

tych0 commented Sep 10, 2025

left me on the edge of my seat waiting for that conclusion!

ha, derp. I updated, thanks.

}

unitName := getUnitName(config)
conn, err := systemdDbus.NewSystemdConnectionContext(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can probably use t.Context() here.

}
defer conn.Close()

properties, err := conn.GetUnitPropertiesContext(context.Background(), unitName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}
if os.Geteuid() != 0 {
t.Skip("Test requires root.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you need to add a check for systemd version here, to skip the test for older versions?

Or, perhaps, get the unit properties, check if the OOMPolicy is there (if it's always there on newer systemd, of course), otherwise t.Skip("no OOMPolicy property; older systemd?")

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense; left some nits. Also I wish we also had a place to document that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants