runc exec: use CLONE_INTO_CGROUP #4812

kolyshkin · 2025-07-16T00:47:14Z

Requires (and currently includes) PR #4822; draft until that one is merged.

It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP), when
available. Since it requires a recent kernel and might not work, implement a fallback.

Based on work done in https://go-review.googlesource.com/c/go/+/417695.

Closes: #4782.

kolyshkin · 2025-07-16T02:37:26Z

OK I did some debugging and have very bad news to share.

Apparently GHA moves the process we create (container's init) to a different cgroup. Here's an excerpt from debug logs (using fs2 cgroup driver):

runc run -d --console-socket /tmp/bats-run-X8QSrN/runc.7IFV0b/tty/sock test_busybox (status=0)
time="2025-07-16T02:31:13Z" level=info msg="XXX container init cgroup /sys/fs/cgroup/system.slice/test_busybox"

Here ^^^ runc created a container and put its init to /system.slice/test_busybox cgroup.

runc exec test_busybox stat /tmp/mount-1/foo.txt /tmp/mount-2/foo.txt (status=255)
XXX container test_busybox init cgroup: /system.slice/hosted-compute-agent.service (present)

Here ^^^ the same container init is unexpectedly in the /system.slice/hosted-compute-agent.service cgroup.

time="2025-07-16T02:31:13Z" level=error msg="exec failed: unable to start container process: can't open cgroup: open /sys/fs/cgroup/system.slice/test_busybox: no such file or directory"

And here ^^^ runc exec failed because container's cgroup no longer exists.

Maybe this is what systemd does? But it doesn't do that on my machine.

I need some time to digest this. Any feedback is welcome.

cyphar · 2025-07-16T03:33:02Z

We mark the transient unit as Delegated=yes, I don't think systemd should be moving it to a different cgroup. Very strange...

kolyshkin · 2025-07-16T04:54:15Z

We mark the transient unit as Delegated=yes, I don't think systemd should be moving it to a different cgroup. Very strange...

This is fs2 driver, no transient unit is created and no way to say Delegate=yes.

This is placed under systemd.slice because of func defaultDirPath in fs2 code, initially added by commit 88e8350 (PR #2169).

When I do the same on my machine (Fedora 42, systemd v257), this is not happening. Wonder if this is specific to either Ubuntu or maybe even Azure/GHA.

lifubang · 2025-07-16T10:25:23Z

I need some time to digest this. Any feedback is welcome.

I notice that all the failures occurred in rootless container tests. This might be related to:

runc/libcontainer/process_linux.go

Line 205 in dfcf22a

    
           // On cgroup v2 + nesting + domain controllers, WriteCgroupProc may fail with EBUSY.

However, you mentioned we're seeing an ENOENT error here, so that may not be the cause.

cyphar · 2025-07-17T01:56:30Z

@kolyshkin Wait, I thought we always communicated with systemd when using cgroup2 -- systemd is very happy to mess with our cgroups (including clearing limits and various other quite dangerous behaviour) if we don't tell it that we are managing the cgroup with Delegate=yes. Maybe this has changed over the years, but I'm fairly certain the initial implementations of this stuff all communicated something with systemd regardless of the cgroup driver used.

Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units...

kolyshkin · 2025-07-24T21:22:28Z

@kolyshkin Wait, I thought we always communicated with systemd when using cgroup2 -- systemd is very happy to mess with our cgroups (including clearing limits and various other quite dangerous behaviour) if we don't tell it that we are managing the cgroup with Delegate=yes. Maybe this has changed over the years, but I'm fairly certain the initial implementations of this stuff all communicated something with systemd regardless of the cgroup driver used.

Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units...

When you use runc directly, unless --systemd-cgroup is explicitly specified, the fs/fs2 driver is used and runc do not communicate with systemd in any way. Which might be just fine, if the systemd is configured to not touch a specific cgroup path and everything under it, and runc is creating cgroups under that path. Having said that, runc with fs/fs2 driver neither configures such thing, nor checks if it is configured.

I'm pretty sure it has been that way from the very beginning.

One other thing is, when using systemd, we configure everything via systemd and then use fs/fs2 driver to write to cgroup directly. This is also how things have always been. One reason for that is we did not care much to translate OCI spec into systemd settings, which is now mostly fixed. Another reason is, systemd doesn't support all per-cgroup settings that the kernel has (so some of those can't be expressed as systemd unit properties).

kolyshkin · 2025-07-24T21:24:42Z

I need some time to digest this. Any feedback is welcome.

I notice that all the failures occurred in rootless container tests. This might be related to:

runc/libcontainer/process_linux.go

Line 205 in dfcf22a

// On cgroup v2 + nesting + domain controllers, WriteCgroupProc may fail with EBUSY.

However, you mentioned we're seeing an ENOENT error here, so that may not be the cause.

The thing is, while the comment says "EBUSY", the actual code doesn't check for particular error, going for this fallback on any error (including ENOENT).

My guess is, with systemd driver we actually need AddPid cgroup driver method to add a pid (like an exec pid) into a pre-created cgroup (as opposed to Apply which creates the cgroup). I'm working on adding it.

This fixes the following warning (seen on Fedora 42 and Ubuntu 24.04): + sudo chown -R rootless.rootless /home/rootless chown: warning: '.' should be ':': ‘rootless.rootless’ Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

The main idea is to maintain the code separately (and eventually kill V1 implementation). Signed-off-by: Kir Kolyshkin <[email protected]>

Remove cgroupPaths field from struct setnsProcess, because: - we can get base cgroup paths from p.manager.GetPaths(); - we can get sub-cgroup paths from p.process.SubCgroupPaths. But mostly because we are going to need separate cgroup paths when adopting cgroups.AddPid. Signed-off-by: Kir Kolyshkin <[email protected]>

The main benefit here is when we are using a systemd cgroup driver, we actually ask systemd to add a PID, rather than doing it ourselves. This way, we can add rootless exec PID to a cgroup. The implementation requires opencontainers/cgroups#26. Signed-off-by: Kir Kolyshkin <[email protected]>

This is based on work done in [1]. Since the functionality requires a recent kernel and might not work, implement a fallback. [1]: https://go-review.googlesource.com/c/go/+/417695 Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2025-08-09T05:58:47Z

Apparently, we are also not placing rootless container exec's into the proper cgroup (which is still possible when using cgroup v2 systemd driver, but we'd need to use AttachProcessesToUnit). As a result, container init and exec are running in different cgroups. This could be a problem because rootless+cgroupv2+systemd-driver can still set resource limits, and exec is running without those.

Tacking it in #4822

kolyshkin force-pushed the exec-clone-into-cgroup branch from aa873c8 to 115aa1f Compare July 16, 2025 02:27

kolyshkin force-pushed the exec-clone-into-cgroup branch 4 times, most recently from 467d16c to dfcf22a Compare July 16, 2025 07:47

kolyshkin force-pushed the exec-clone-into-cgroup branch from dfcf22a to 6095b61 Compare July 16, 2025 22:46

kolyshkin force-pushed the exec-clone-into-cgroup branch 5 times, most recently from b98e111 to 6e3bf36 Compare July 27, 2025 23:22

kolyshkin added 7 commits July 28, 2025 16:24

script/setup_rootless.sh: chown nit

882d1fb

This fixes the following warning (seen on Fedora 42 and Ubuntu 24.04): + sudo chown -R rootless.rootless /home/rootless chown: warning: '.' should be ':': ‘rootless.rootless’ Signed-off-by: Kir Kolyshkin <[email protected]>

libct: factor out addIntoCgroup from setnsProcess.start

32b08e6

Signed-off-by: Kir Kolyshkin <[email protected]>

libct: split addIntoCgroup into V1 and V2

0b3056b

The main idea is to maintain the code separately (and eventually kill V1 implementation). Signed-off-by: Kir Kolyshkin <[email protected]>

runc exec: use CLONE_INTO_CGROUP when available

fa2cf3d

This is based on work done in [1]. Since the functionality requires a recent kernel and might not work, implement a fallback. [1]: https://go-review.googlesource.com/c/go/+/417695 Signed-off-by: Kir Kolyshkin <[email protected]>

[test] ci/gha: run tests in own systemd slice

9489925

Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin force-pushed the exec-clone-into-cgroup branch from 6e3bf36 to 9489925 Compare July 29, 2025 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runc exec: use CLONE_INTO_CGROUP #4812

runc exec: use CLONE_INTO_CGROUP #4812

Uh oh!

kolyshkin commented Jul 16, 2025 •

edited

Loading

Uh oh!

kolyshkin commented Jul 16, 2025 •

edited

Loading

Uh oh!

cyphar commented Jul 16, 2025

Uh oh!

kolyshkin commented Jul 16, 2025

Uh oh!

lifubang commented Jul 16, 2025

Uh oh!

cyphar commented Jul 17, 2025 •

edited

Loading

Uh oh!

kolyshkin commented Jul 24, 2025

Uh oh!

kolyshkin commented Jul 24, 2025

Uh oh!

kolyshkin commented Aug 9, 2025

Uh oh!

Uh oh!

runc exec: use CLONE_INTO_CGROUP #4812

Are you sure you want to change the base?

runc exec: use CLONE_INTO_CGROUP #4812

Uh oh!

Conversation

kolyshkin commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolyshkin commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Jul 16, 2025

Uh oh!

kolyshkin commented Jul 16, 2025

Uh oh!

lifubang commented Jul 16, 2025

Uh oh!

cyphar commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolyshkin commented Jul 24, 2025

Uh oh!

kolyshkin commented Jul 24, 2025

Uh oh!

kolyshkin commented Aug 9, 2025

Uh oh!

Uh oh!

kolyshkin commented Jul 16, 2025 •

edited

Loading

kolyshkin commented Jul 16, 2025 •

edited

Loading

cyphar commented Jul 17, 2025 •

edited

Loading