Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

This pull request introduces significant updates to the end-to-end (E2E) testing framework for the NVIDIA Kubernetes Device Plugin. The changes include enhancements to the CI workflow, a migration to the Ginkgo v2 testing framework, restructuring of test-related files, and updates to licensing information. Below is a categorized summary of the most important changes:

Workflow and CI Updates:

  • Added a new HELM_CHART environment variable to the E2E CI workflow (.github/workflows/e2e.yaml) and updated the make command to use a specific Makefile for E2E tests. This improves configurability and aligns the workflow with the new test structure.
  • Replaced the Slack alert notification step with a new Ginkgo log archiving step and updated the Slack notification configuration to use a JSON payload for better integration.

Migration to Ginkgo v2:

  • Updated the Makefile to include a ginkgo target for installing the Ginkgo v2 CLI and a new test-e2e target for running tests with Ginkgo. This replaces the previous custom test logic.
  • Renamed and restructured test files to align with Ginkgo's conventions, such as renaming common/kubernetes.go to cleanup_test.go and removing redundant utility functions. [1] [2]

Documentation Enhancements:

  • Added a comprehensive README.md for the E2E test suite, detailing prerequisites, environment variables, execution flow, and troubleshooting steps. This improves developer onboarding and test maintainability.

Licensing and Copyright:

  • Updated all test-related files to use the SPDX license headers, reflecting the transition to a 2025 copyright year and the Apache 2.0 license. [1] [2] [3]

Codebase Simplification:

  • Removed unused test utilities, such as gpu_job.go and taints.go, consolidating functionality into more focused test files. This reduces redundancy and simplifies the codebase. [1] [2]

These changes collectively modernize the testing framework, improve CI integration, and enhance developer experience.

@ArangoGutierrez ArangoGutierrez self-assigned this Apr 24, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modernizes the NVIDIA Kubernetes Device Plugin’s E2E test framework by upgrading to Ginkgo v2, updating CI workflows, and adjusting file structure and licensing information.

  • Migrates tests from legacy framework patterns to Ginkgo v2 constructs
  • Updates CI workflow to use environment variables and a dedicated Makefile target
  • Removes deprecated utility files and refactors test helpers for consistency

Reviewed Changes

Copilot reviewed 405 out of 407 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/e2e/utils.go Updated license headers and refactored helper functions with adjusted timeouts
tests/e2e/infra/aws.yaml Modified configuration for helm chart deployments and toolkit settings
tests/e2e/gpu-feature-discovery_test.go Migration to Ginkgo v2 and changes to use environment variables and updated client references
tests/e2e/e2e_test.go Overhauled test entry point to use environment variables and new helper functions
tests/e2e/device-plugin_test.go Refactored test structure and naming for Helm release names and client usage
tests/e2e/cleanup_test.go Consolidated cleanup functions for node, resource, and CRD removal
tests/e2e/README.md Added comprehensive documentation for running and configuring the E2E suite
.github/workflows/e2e.yaml Updated workflow steps to use a Makefile in the E2E directory and improved artifact handling
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (3)

tests/e2e/device-plugin_test.go:80

  • [nitpick] Switching to randomSuffix() for generating helmReleaseName improves uniqueness; consider updating any accompanying comments or documentation to reflect this change in naming strategy.
helmReleaseName = "nvdp-e2e-test-" + randomSuffix()

.github/workflows/e2e.yaml:66

  • [nitpick] The workflow now explicitly references a Makefile in the tests/e2e directory; verify that this path remains accurate and that all necessary targets are defined in the referenced Makefile.
make -f tests/e2e/Makefile test-e2e

tests/e2e/utils.go:45

  • The timeout for eventuallyNonControlPlaneNodes has been increased from 10 seconds to 1 minute; please confirm that this longer duration aligns with the overall test responsiveness and cluster performance expectations.
}).WithPolling(1 * time.Second).WithTimeout(1 * time.Minute).WithContext(ctx)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for this change just consistency with the DRA driver?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am now working to bring more consistency across all the repos we manage

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request modernizes the E2E test framework by migrating to Ginkgo v2, reorganizing test files, updating CI workflows, and refreshing licensing and documentation. Key changes include the restructuring and renaming of test files (e.g. renaming common/kubernetes.go to cleanup_test.go), enhancement of CI configurations (e.g. new Makefile target and Slack payload update), and removal of obsolete utilities.

Reviewed Changes

Copilot reviewed 406 out of 408 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/e2e/infra/aws.yaml Adjusted configuration ordering and added new keys for CDI enablement.
tests/e2e/gpu-feature-discovery_test.go Migrated test syntax and variable usage to align with Ginkgo v2 conventions.
tests/e2e/device-plugin_test.go Updated Helm chart installation and job creation logic.
tests/e2e/cleanup_test.go Refactored cleanup functions; moved functions from package common to e2e.
tests/e2e/README.md Added comprehensive usage and troubleshooting documentation.
.github/workflows/e2e.yaml Updated environment variables, Makefile target, and Slack notification steps.
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (3)

tests/e2e/cleanup_test.go:267

  • The 'ctx' variable is used in the cleanupNodeFeatureRules function without being defined. Consider passing a context parameter to this function to correctly scope API calls.
nfrs, err := cli.NfdV1alpha1().NodeFeatureRules().List(ctx, metav1.ListOptions{})

tests/e2e/cleanup_test.go:69

  • The variable 'ctx' is used in the cleanupTestPods function without being defined or passed as an argument. Consider either passing a context parameter or declaring 'ctx' locally.
podList, err := clientSet.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{

tests/e2e/cleanup_test.go:114

  • The variable 'ctx' is used without being defined in this cleanup helper. Ensure that a valid context is passed to this function or declared within its scope to avoid runtime errors.
err := clientSet.CoreV1().Namespaces().Delete(ctx, testNamespace.Name, metav1.DeleteOptions{})

tariq1890
tariq1890 previously approved these changes May 2, 2025
@ArangoGutierrez
Copy link
Collaborator Author

Thanks for the review and approval @tariq1890 , any additional comments @elezar ?

@ArangoGutierrez
Copy link
Collaborator Author

Now that we have merged the PR in CTK can we do a final round here @elezar

@ArangoGutierrez ArangoGutierrez added the testing issue/PR to fix/edit/create/enhance a project unit/e2e test label May 26, 2025
@ArangoGutierrez
Copy link
Collaborator Author

Branch rebased on main, PR ready for review
@elezar

@ArangoGutierrez ArangoGutierrez force-pushed the ginkgo-e2e branch 2 times, most recently from 68f3bea to 474583c Compare July 24, 2025 10:37
@ArangoGutierrez ArangoGutierrez force-pushed the ginkgo-e2e branch 4 times, most recently from b90ff8e to a9b9002 Compare August 4, 2025 10:34
@ArangoGutierrez
Copy link
Collaborator Author

@elezar Now E2E pass
This PR includes the bump to Holodeck as well, to provide a full update to the E2E suite

// Note: DaemonSet names are dynamically generated with the Helm release prefix,
// so we wait for all DaemonSets in the namespace rather than specific names
By("Waiting for all DaemonSets to be ready")
err = internal.WaitForAllDaemonSetsReady(ctx, clientSet, testNamespace.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the DaemonSets also have app labels, or some other identifier that we could use to further restrict this set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the DaemonSets have labels set during Helm install. But since the namespace is unique for the test run, this is enough to capture the objects we are monitoring.

But I have added label filtering

Comment on lines 217 to 218
Kubeconfig = os.Getenv("KUBECONFIG")
Expect(Kubeconfig).NotTo(BeEmpty(), "KUBECONFIG must be set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In nvidia-container-toolkit we have the concept of a REQUIRED envvar that we encapsulated in a function.


// CreateOrUpdateJobsFromFile creates or updates jobs from a file
func CreateOrUpdateJobsFromFile(ctx context.Context, cli clientset.Interface, filename, namespace string) ([]string, error) {
jobs, err := newJobFromfile(filepath.Join(packagePath, "data", filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one question would be why don't we provide the full relative path to the file instead of ASSUMING data? Searching for data/job-1.yaml when we do decide to move these files (or add new ones) will be simpler than searching for each path component separately.

Comment on lines 549 to 522
if err != nil {
if apierrors.IsNotFound(err) {
_, err = cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{})
if err != nil {
return nil, fmt.Errorf("failed to create Job %s: %w", job.Name, err)
}
} else {
return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this nesting patter common in other projects?
We could use:

Suggested change
if err != nil {
if apierrors.IsNotFound(err) {
_, err = cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{})
if err != nil {
return nil, fmt.Errorf("failed to create Job %s: %w", job.Name, err)
}
} else {
return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err)
}
}
if apierrors.IsNotFound(err) {
if _, cerr := cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{}); cerr != nil {
err = errors.Join(err, fmt.Errorf("failed to create job: %w", cerr))
}
}
if err != nil {
return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err)
}

return nil, err
}

// TODO: find out a nicer way to decode multiple api objects from a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does kubectl handle this?

Copilot

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces significant updates to the end-to-end (E2E) testing framework for the NVIDIA Kubernetes Device Plugin, migrating from a custom test framework to Ginkgo v2 and enhancing CI integration.

  • Migration to Ginkgo v2 testing framework with modern patterns and improved test structure
  • Enhanced CI workflow configuration with updated Helm chart environment variables and improved logging
  • Comprehensive codebase refactoring with removal of legacy framework components and consolidation of test utilities

Reviewed Changes

Copilot reviewed 18 out of 52 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/go.mod Updates Go version to 1.24.0 and refreshes dependency versions including Ginkgo v2 and test infrastructure libraries
tests/e2e/e2e_test.go Complete rewrite implementing Ginkgo v2 patterns with suite-level setup, global clients, and comprehensive test utilities
tests/e2e/device-plugin_test.go Modernizes device plugin tests using new Ginkgo v2 syntax and shared test infrastructure
tests/e2e/gpu-feature-discovery_test.go Updates GFD tests to use new framework patterns and removes deprecated dependencies
tests/e2e/internal/kube.go Adds new Kubernetes utility functions for DaemonSet and Deployment management with proper context handling
tests/e2e/infra/aws.yaml Simplifies AWS infrastructure configuration by removing hardcoded IP ranges and updating container toolkit settings
tests/e2e/README.md Provides comprehensive documentation for the updated E2E test suite with setup instructions and troubleshooting
tests/e2e/Makefile Replaces custom test logic with Ginkgo CLI integration and adds proper artifact management
.github/workflows/e2e.yaml Updates CI workflow to use new Makefile targets and improves artifact collection with Ginkgo log archiving

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

suiteName := "E2E K8s Device Plugin"

RegisterFailHandler(Fail)

Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestMain should take *testing.M as parameter, not *testing.T. The correct signature is func TestMain(m *testing.M). This will cause compilation errors.

Suggested change
func TestMain(m *testing.M) {

Copilot uses AI. Check for mistakes.
// Log random seed for reproducibility
GinkgoWriter.Printf("Random seed: %d\n", GinkgoRandomSeed())

RunSpecs(t,
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TestMain receives *testing.T (which should be *testing.M), this call to RunSpecs with t parameter will fail. RunSpecs expects *testing.T but TestMain should provide *testing.M and call m.Run().

Copilot uses AI. Check for mistakes.

// eventuallyNonControlPlaneNodes is a helper for asserting node properties
//
//nolint:unused
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eventuallyNonControlPlaneNodes function is marked as unused but appears to be used in other test files. Consider removing the unused annotation or verifying if this function is actually needed.

Suggested change
//nolint:unused

Copilot uses AI. Check for mistakes.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar force-pushed the ginkgo-e2e branch 3 times, most recently from 9052993 to 00b810a Compare September 17, 2025 12:24
ArangoGutierrez and others added 3 commits September 17, 2025 14:42
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 18 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

By("Creating a GPU job")
jobNames, err := CreateOrUpdateJobsFromFile(ctx, clientSet, testNamespace.Name, filepath.Join(projectRoot, "testdata", "job-1.yaml"))
Expect(err).NotTo(HaveOccurred())
Expect(jobNames).NotTo(HaveLen(1))
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic is inverted. This will fail when exactly one job is created (the expected case). Should be Expect(jobNames).To(HaveLen(1)) to verify that exactly one job was created.

Suggested change
Expect(jobNames).NotTo(HaveLen(1))
Expect(jobNames).To(HaveLen(1))

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +84
fmt.Sprintf("image.repository=%s", ImageRepo),
fmt.Sprintf("image.tag=%s", ImageTag),
fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy),
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables (ImageRepo, ImageTag, ImagePullPolicy) are referenced without pointer dereference, but they appear to be global variables that should be dereferenced. The original code used *ImageRepo, *ImageTag, *ImagePullPolicy.

Suggested change
fmt.Sprintf("image.repository=%s", ImageRepo),
fmt.Sprintf("image.tag=%s", ImageTag),
fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy),
fmt.Sprintf("image.repository=%s", *ImageRepo),
fmt.Sprintf("image.tag=%s", *ImageTag),
fmt.Sprintf("image.pullPolicy=%s", *ImagePullPolicy),

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +68
fmt.Sprintf("image.repository=%s", ImageRepo),
fmt.Sprintf("image.tag=%s", ImageTag),
fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy),
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables (ImageRepo, ImageTag, ImagePullPolicy) are referenced without pointer dereference, but they appear to be global variables that should be dereferenced. The original code used *ImageRepo, *ImageTag, *ImagePullPolicy.

Suggested change
fmt.Sprintf("image.repository=%s", ImageRepo),
fmt.Sprintf("image.tag=%s", ImageTag),
fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy),
fmt.Sprintf("image.repository=%s", *ImageRepo),
fmt.Sprintf("image.tag=%s", *ImageTag),
fmt.Sprintf("image.pullPolicy=%s", *ImagePullPolicy),

Copilot uses AI. Check for mistakes.
var err error
diagnosticsCollector, err = diagnostics.New(
diagnostics.WithNamespace(testNamespace.Name),
diagnostics.WithArtifactDir(LogArtifactDir),
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogArtifactDir variable is referenced without pointer dereference, but appears to be a global variable that should be dereferenced. The original code used *LogArtifactDir.

Suggested change
diagnostics.WithArtifactDir(LogArtifactDir),
diagnostics.WithArtifactDir(*LogArtifactDir),

Copilot uses AI. Check for mistakes.
var err error
diagnosticsCollector, err = diagnostics.New(
diagnostics.WithNamespace(testNamespace.Name),
diagnostics.WithArtifactDir(LogArtifactDir),
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogArtifactDir variable is referenced without pointer dereference, but appears to be a global variable that should be dereferenced. The original code used *LogArtifactDir.

Suggested change
diagnostics.WithArtifactDir(LogArtifactDir),
diagnostics.WithArtifactDir(*LogArtifactDir),

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing issue/PR to fix/edit/create/enhance a project unit/e2e test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants