Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive E2E testing for containerd runtime configuration alongside existing Docker testing infrastructure. The changes introduce a nested container testing framework that allows running tests inside containers to validate NVIDIA Container Toolkit behavior in containerized environments.

  • Adds new E2E tests for containerd drop-in configuration functionality
  • Introduces nvidia-cdi-refresh systemd unit testing
  • Implements nested container runner infrastructure for isolated testing

Reviewed Changes

Copilot reviewed 9 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/go.mod Adds new dependencies for UUID generation and test utilities
tests/e2e/runner.go Implements nested container runner with Docker installation and CTK setup
tests/e2e/nvidia-ctk_containerd_test.go New comprehensive containerd E2E test suite
tests/e2e/nvidia-ctk_docker_test.go Refactors to use shared runner infrastructure and fixes macOS compatibility
tests/e2e/nvidia-cdi-refresh_test.go New systemd unit tests for CDI refresh functionality
tests/e2e/nvidia-container-cli_test.go Refactors to use nested container runner
tests/e2e/installer.go Adds containerd installation template and additional flags support
tests/e2e/e2e_test.go Centralizes test runner initialization in BeforeSuite
tests/e2e/Makefile Documents new test categories

@ArangoGutierrez
Copy link
Collaborator Author

Builds on #1235

Doesn't include #1311 tests for that should be added as a follow up

@coveralls
Copy link

coveralls commented Sep 23, 2025

Pull Request Test Coverage Report for Build 18005738357

Details

  • 0 of 1 (0.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 36.277%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/config/engine/containerd/config_drop_in.go 0 1 0.0%
Totals Coverage Status
Change from base Build 17981864462: 0.006%
Covered Lines: 4827
Relevant Lines: 13306

💛 - Coveralls

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

Rebased

@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 3 times, most recently from 029af03 to 1899001 Compare September 25, 2025 11:16
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review September 25, 2025 11:26
@elezar elezar marked this pull request as draft October 13, 2025 12:10
@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 2 times, most recently from 51ad031 to c65e468 Compare October 13, 2025 13:57
@ArangoGutierrez
Copy link
Collaborator Author

Rebased

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review October 13, 2025 14:01
AfterAll(func(ctx context.Context) {
// Cleanup: remove the container and the temporary script on the host.
// Use || true to ensure cleanup doesn't fail the test
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the nolint let's just drop the return values.

Suggested change
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
_, _, _ = runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))

Does it mak sense to at least WARN if the cleanup fails? The || true doesn't ensure that the test doesn't fail, the fact that we don't check the return value does that.

Comment on lines 56 to 59
# Remove any imports line from the config (reset to original state)
if [ -f /etc/containerd/config.toml ]; then
grep -v "^imports = " /etc/containerd/config.toml > /tmp/config.toml.tmp && mv /tmp/config.toml.tmp /etc/containerd/config.toml || true
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make a copy of the original config and restore that after / before each test?


# Restart containerd to pick up the clean config
systemctl restart containerd
sleep 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check containerd health?

Comment on lines 80 to 84
output, _, err := nestedContainerRunner.Run(`cat /etc/containerd/conf.d/99-nvidia.toml`)
Expect(err).ToNot(HaveOccurred())
Expect(output).To(ContainSubstring(`nvidia`))
Expect(output).To(ContainSubstring(`nvidia-cdi`))
Expect(output).To(ContainSubstring(`nvidia-legacy`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in person, we are nolonger triggering the configuration of containerd with the current installation mechanism.

output, _, err = nestedContainerRunner.Run(`containerd config dump`)
Expect(err).ToNot(HaveOccurred())
// Verify imports section is in the merged config
Expect(output).To(ContainSubstring(`imports = ['/etc/containerd/conf.d/*.toml']`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that config dump prints that ACTUAL paths of all files processed.

Comment on lines +97 to +188
ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should definitely be VERSION specific checks.

ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
))
Expect(output).To(ContainSubstring(`enable_cdi = true`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we toggle this behaviour? It is disabled by default.

Comment on lines +105 to +196
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, thsi should be version-specific.

})

When("containerd already has a custom default runtime configured", func() {
It("should preserve the existing default runtime when --set-as-default=false is specified", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--set-as-default=false is not specified. It is the default.

`)
Expect(err).ToNot(HaveOccurred())

// Configure containerd with drop-in config (explicitly not setting as default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we configuring with the drop-in config in this case?

})

When("containerd has multiple custom runtimes and plugins configured", func() {
It("should add NVIDIA runtime alongside existing runtimes like kata", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of interest, how is this different to an arbitrarry "custom" runtime?

Expect(err).ToNot(HaveOccurred())

// Verify kata runtime was added
output, _, err := nestedContainerRunner.Run(`systemctl restart containerd && sleep 2 && containerd config dump | grep -A5 kata`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the -A5? Also. Please split the different steps.

Expect(err).ToNot(HaveOccurred())
Expect(output).To(ContainSubstring(`kata`))

// Configure containerd with drop-in config and set NVIDIA as default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we setting these options?

Comment on lines 246 to 252
_, _, err := nestedContainerRunner.Run(`
rm -f /etc/containerd/config.toml
rm -rf /etc/containerd/conf.d
mkdir -p /etc/containerd/conf.d

cat > /etc/containerd/config.toml <<'EOF'
version = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure that this is only run on containerd versions that support it?


# Create a custom config that will be imported
# Use the correct plugin path for containerd v2/v3
cat > /etc/containerd/custom.d/10-custom.toml <<'EOF'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the correct path was: /etc/containerd/conf.d? https://github.com/containerd/containerd/pull/12323/files

Expect(err).ToNot(HaveOccurred())

// Verify containerd can load the custom import before installer
_, _, err = nestedContainerRunner.Run(`systemctl restart containerd && sleep 2 && containerd config dump | grep -i myregistry`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we randomly checking for myregistry? Is this sufficient?

// Run a container with NVIDIA runtime
// Note: We use native snapshotter because overlay doesn't work in nested containers
output, stderr, err = nestedContainerRunner.Run(`
ctr run --rm --snapshotter native docker.io/library/busybox:latest test-nvidia echo "Hello from NVIDIA runtime"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this use the nvidia runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if no devices are requested, then the nvidia-runtime is a no-op.


// Pull a test image
output, stderr, err := nestedContainerRunner.Run(`
ctr image pull docker.io/library/busybox:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ctr automatically use the containerd config?

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants