-
Notifications
You must be signed in to change notification settings - Fork 745
[no-relnote] Update E2E test suite #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modernizes the NVIDIA Kubernetes Device Plugin’s E2E test framework by upgrading to Ginkgo v2, updating CI workflows, and adjusting file structure and licensing information.
- Migrates tests from legacy framework patterns to Ginkgo v2 constructs
- Updates CI workflow to use environment variables and a dedicated Makefile target
- Removes deprecated utility files and refactors test helpers for consistency
Reviewed Changes
Copilot reviewed 405 out of 407 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/utils.go | Updated license headers and refactored helper functions with adjusted timeouts |
| tests/e2e/infra/aws.yaml | Modified configuration for helm chart deployments and toolkit settings |
| tests/e2e/gpu-feature-discovery_test.go | Migration to Ginkgo v2 and changes to use environment variables and updated client references |
| tests/e2e/e2e_test.go | Overhauled test entry point to use environment variables and new helper functions |
| tests/e2e/device-plugin_test.go | Refactored test structure and naming for Helm release names and client usage |
| tests/e2e/cleanup_test.go | Consolidated cleanup functions for node, resource, and CRD removal |
| tests/e2e/README.md | Added comprehensive documentation for running and configuring the E2E suite |
| .github/workflows/e2e.yaml | Updated workflow steps to use a Makefile in the E2E directory and improved artifact handling |
Files not reviewed (2)
- tests/e2e/Makefile: Language not supported
- tests/go.mod: Language not supported
Comments suppressed due to low confidence (3)
tests/e2e/device-plugin_test.go:80
- [nitpick] Switching to randomSuffix() for generating helmReleaseName improves uniqueness; consider updating any accompanying comments or documentation to reflect this change in naming strategy.
helmReleaseName = "nvdp-e2e-test-" + randomSuffix()
.github/workflows/e2e.yaml:66
- [nitpick] The workflow now explicitly references a Makefile in the tests/e2e directory; verify that this path remains accurate and that all necessary targets are defined in the referenced Makefile.
make -f tests/e2e/Makefile test-e2e
tests/e2e/utils.go:45
- The timeout for eventuallyNonControlPlaneNodes has been increased from 10 seconds to 1 minute; please confirm that this longer duration aligns with the overall test responsiveness and cluster performance expectations.
}).WithPolling(1 * time.Second).WithTimeout(1 * time.Minute).WithContext(ctx)
9037244 to
2c5aadd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason for this change just consistency with the DRA driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am now working to bring more consistency across all the repos we manage
2c5aadd to
047d5f4
Compare
b508d98 to
4937906
Compare
4937906 to
60a5ecf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request modernizes the E2E test framework by migrating to Ginkgo v2, reorganizing test files, updating CI workflows, and refreshing licensing and documentation. Key changes include the restructuring and renaming of test files (e.g. renaming common/kubernetes.go to cleanup_test.go), enhancement of CI configurations (e.g. new Makefile target and Slack payload update), and removal of obsolete utilities.
Reviewed Changes
Copilot reviewed 406 out of 408 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/infra/aws.yaml | Adjusted configuration ordering and added new keys for CDI enablement. |
| tests/e2e/gpu-feature-discovery_test.go | Migrated test syntax and variable usage to align with Ginkgo v2 conventions. |
| tests/e2e/device-plugin_test.go | Updated Helm chart installation and job creation logic. |
| tests/e2e/cleanup_test.go | Refactored cleanup functions; moved functions from package common to e2e. |
| tests/e2e/README.md | Added comprehensive usage and troubleshooting documentation. |
| .github/workflows/e2e.yaml | Updated environment variables, Makefile target, and Slack notification steps. |
Files not reviewed (2)
- tests/e2e/Makefile: Language not supported
- tests/go.mod: Language not supported
Comments suppressed due to low confidence (3)
tests/e2e/cleanup_test.go:267
- The 'ctx' variable is used in the cleanupNodeFeatureRules function without being defined. Consider passing a context parameter to this function to correctly scope API calls.
nfrs, err := cli.NfdV1alpha1().NodeFeatureRules().List(ctx, metav1.ListOptions{})
tests/e2e/cleanup_test.go:69
- The variable 'ctx' is used in the cleanupTestPods function without being defined or passed as an argument. Consider either passing a context parameter or declaring 'ctx' locally.
podList, err := clientSet.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
tests/e2e/cleanup_test.go:114
- The variable 'ctx' is used without being defined in this cleanup helper. Ensure that a valid context is passed to this function or declared within its scope to avoid runtime errors.
err := clientSet.CoreV1().Namespaces().Delete(ctx, testNamespace.Name, metav1.DeleteOptions{})
|
Thanks for the review and approval @tariq1890 , any additional comments @elezar ? |
|
Now that we have merged the PR in CTK can we do a final round here @elezar |
60a5ecf to
f98f08c
Compare
|
Branch rebased on main, PR ready for review |
68f3bea to
474583c
Compare
b90ff8e to
a9b9002
Compare
|
@elezar Now E2E pass |
tests/e2e/device-plugin_test.go
Outdated
| // Note: DaemonSet names are dynamically generated with the Helm release prefix, | ||
| // so we wait for all DaemonSets in the namespace rather than specific names | ||
| By("Waiting for all DaemonSets to be ready") | ||
| err = internal.WaitForAllDaemonSetsReady(ctx, clientSet, testNamespace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the DaemonSets also have app labels, or some other identifier that we could use to further restrict this set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the DaemonSets have labels set during Helm install. But since the namespace is unique for the test run, this is enough to capture the objects we are monitoring.
But I have added label filtering
tests/e2e/e2e_test.go
Outdated
| Kubeconfig = os.Getenv("KUBECONFIG") | ||
| Expect(Kubeconfig).NotTo(BeEmpty(), "KUBECONFIG must be set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: In nvidia-container-toolkit we have the concept of a REQUIRED envvar that we encapsulated in a function.
tests/e2e/e2e_test.go
Outdated
|
|
||
| // CreateOrUpdateJobsFromFile creates or updates jobs from a file | ||
| func CreateOrUpdateJobsFromFile(ctx context.Context, cli clientset.Interface, filename, namespace string) ([]string, error) { | ||
| jobs, err := newJobFromfile(filepath.Join(packagePath, "data", filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one question would be why don't we provide the full relative path to the file instead of ASSUMING data? Searching for data/job-1.yaml when we do decide to move these files (or add new ones) will be simpler than searching for each path component separately.
tests/e2e/e2e_test.go
Outdated
| if err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| _, err = cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create Job %s: %w", job.Name, err) | ||
| } | ||
| } else { | ||
| return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this nesting patter common in other projects?
We could use:
| if err != nil { | |
| if apierrors.IsNotFound(err) { | |
| _, err = cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{}) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create Job %s: %w", job.Name, err) | |
| } | |
| } else { | |
| return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err) | |
| } | |
| } | |
| if apierrors.IsNotFound(err) { | |
| if _, cerr := cli.BatchV1().Jobs(namespace).Create(ctx, job, metav1.CreateOptions{}); cerr != nil { | |
| err = errors.Join(err, fmt.Errorf("failed to create job: %w", cerr)) | |
| } | |
| } | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to get Job %s: %w", job.Name, err) | |
| } |
tests/e2e/e2e_test.go
Outdated
| return nil, err | ||
| } | ||
|
|
||
| // TODO: find out a nicer way to decode multiple api objects from a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does kubectl handle this?
4a340b3 to
2322a06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces significant updates to the end-to-end (E2E) testing framework for the NVIDIA Kubernetes Device Plugin, migrating from a custom test framework to Ginkgo v2 and enhancing CI integration.
- Migration to Ginkgo v2 testing framework with modern patterns and improved test structure
- Enhanced CI workflow configuration with updated Helm chart environment variables and improved logging
- Comprehensive codebase refactoring with removal of legacy framework components and consolidation of test utilities
Reviewed Changes
Copilot reviewed 18 out of 52 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/go.mod | Updates Go version to 1.24.0 and refreshes dependency versions including Ginkgo v2 and test infrastructure libraries |
| tests/e2e/e2e_test.go | Complete rewrite implementing Ginkgo v2 patterns with suite-level setup, global clients, and comprehensive test utilities |
| tests/e2e/device-plugin_test.go | Modernizes device plugin tests using new Ginkgo v2 syntax and shared test infrastructure |
| tests/e2e/gpu-feature-discovery_test.go | Updates GFD tests to use new framework patterns and removes deprecated dependencies |
| tests/e2e/internal/kube.go | Adds new Kubernetes utility functions for DaemonSet and Deployment management with proper context handling |
| tests/e2e/infra/aws.yaml | Simplifies AWS infrastructure configuration by removing hardcoded IP ranges and updating container toolkit settings |
| tests/e2e/README.md | Provides comprehensive documentation for the updated E2E test suite with setup instructions and troubleshooting |
| tests/e2e/Makefile | Replaces custom test logic with Ginkgo CLI integration and adds proper artifact management |
| .github/workflows/e2e.yaml | Updates CI workflow to use new Makefile targets and improves artifact collection with Ginkgo log archiving |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| suiteName := "E2E K8s Device Plugin" | ||
|
|
||
| RegisterFailHandler(Fail) | ||
|
|
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestMain should take *testing.M as parameter, not *testing.T. The correct signature is func TestMain(m *testing.M). This will cause compilation errors.
| func TestMain(m *testing.M) { |
| // Log random seed for reproducibility | ||
| GinkgoWriter.Printf("Random seed: %d\n", GinkgoRandomSeed()) | ||
|
|
||
| RunSpecs(t, |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since TestMain receives *testing.T (which should be *testing.M), this call to RunSpecs with t parameter will fail. RunSpecs expects *testing.T but TestMain should provide *testing.M and call m.Run().
tests/e2e/e2e_test.go
Outdated
|
|
||
| // eventuallyNonControlPlaneNodes is a helper for asserting node properties | ||
| // | ||
| //nolint:unused |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eventuallyNonControlPlaneNodes function is marked as unused but appears to be used in other test files. Consider removing the unused annotation or verifying if this function is actually needed.
| //nolint:unused |
8baf4e8 to
0ae8c6c
Compare
0ae8c6c to
d98f962
Compare
9052993 to
00b810a
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 18 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tests/e2e/device-plugin_test.go
Outdated
| By("Creating a GPU job") | ||
| jobNames, err := CreateOrUpdateJobsFromFile(ctx, clientSet, testNamespace.Name, filepath.Join(projectRoot, "testdata", "job-1.yaml")) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| Expect(jobNames).NotTo(HaveLen(1)) |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion logic is inverted. This will fail when exactly one job is created (the expected case). Should be Expect(jobNames).To(HaveLen(1)) to verify that exactly one job was created.
| Expect(jobNames).NotTo(HaveLen(1)) | |
| Expect(jobNames).To(HaveLen(1)) |
| fmt.Sprintf("image.repository=%s", ImageRepo), | ||
| fmt.Sprintf("image.tag=%s", ImageTag), | ||
| fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy), |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables (ImageRepo, ImageTag, ImagePullPolicy) are referenced without pointer dereference, but they appear to be global variables that should be dereferenced. The original code used *ImageRepo, *ImageTag, *ImagePullPolicy.
| fmt.Sprintf("image.repository=%s", ImageRepo), | |
| fmt.Sprintf("image.tag=%s", ImageTag), | |
| fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy), | |
| fmt.Sprintf("image.repository=%s", *ImageRepo), | |
| fmt.Sprintf("image.tag=%s", *ImageTag), | |
| fmt.Sprintf("image.pullPolicy=%s", *ImagePullPolicy), |
| fmt.Sprintf("image.repository=%s", ImageRepo), | ||
| fmt.Sprintf("image.tag=%s", ImageTag), | ||
| fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy), |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables (ImageRepo, ImageTag, ImagePullPolicy) are referenced without pointer dereference, but they appear to be global variables that should be dereferenced. The original code used *ImageRepo, *ImageTag, *ImagePullPolicy.
| fmt.Sprintf("image.repository=%s", ImageRepo), | |
| fmt.Sprintf("image.tag=%s", ImageTag), | |
| fmt.Sprintf("image.pullPolicy=%s", ImagePullPolicy), | |
| fmt.Sprintf("image.repository=%s", *ImageRepo), | |
| fmt.Sprintf("image.tag=%s", *ImageTag), | |
| fmt.Sprintf("image.pullPolicy=%s", *ImagePullPolicy), |
| var err error | ||
| diagnosticsCollector, err = diagnostics.New( | ||
| diagnostics.WithNamespace(testNamespace.Name), | ||
| diagnostics.WithArtifactDir(LogArtifactDir), |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogArtifactDir variable is referenced without pointer dereference, but appears to be a global variable that should be dereferenced. The original code used *LogArtifactDir.
| diagnostics.WithArtifactDir(LogArtifactDir), | |
| diagnostics.WithArtifactDir(*LogArtifactDir), |
| var err error | ||
| diagnosticsCollector, err = diagnostics.New( | ||
| diagnostics.WithNamespace(testNamespace.Name), | ||
| diagnostics.WithArtifactDir(LogArtifactDir), |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogArtifactDir variable is referenced without pointer dereference, but appears to be a global variable that should be dereferenced. The original code used *LogArtifactDir.
| diagnostics.WithArtifactDir(LogArtifactDir), | |
| diagnostics.WithArtifactDir(*LogArtifactDir), |
Signed-off-by: Evan Lezar <[email protected]>
This pull request introduces significant updates to the end-to-end (E2E) testing framework for the NVIDIA Kubernetes Device Plugin. The changes include enhancements to the CI workflow, a migration to the Ginkgo v2 testing framework, restructuring of test-related files, and updates to licensing information. Below is a categorized summary of the most important changes:
Workflow and CI Updates:
HELM_CHARTenvironment variable to the E2E CI workflow (.github/workflows/e2e.yaml) and updated themakecommand to use a specific Makefile for E2E tests. This improves configurability and aligns the workflow with the new test structure.Migration to Ginkgo v2:
Makefileto include aginkgotarget for installing the Ginkgo v2 CLI and a newtest-e2etarget for running tests with Ginkgo. This replaces the previous custom test logic.common/kubernetes.gotocleanup_test.goand removing redundant utility functions. [1] [2]Documentation Enhancements:
README.mdfor the E2E test suite, detailing prerequisites, environment variables, execution flow, and troubleshooting steps. This improves developer onboarding and test maintainability.Licensing and Copyright:
Codebase Simplification:
gpu_job.goandtaints.go, consolidating functionality into more focused test files. This reduces redundancy and simplifies the codebase. [1] [2]These changes collectively modernize the testing framework, improve CI integration, and enhance developer experience.