Skip to content

Conformance and Functional Tests Failing Inconsistently #3433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mpstefan opened this issue Jun 2, 2025 · 3 comments · Fixed by #3446
Closed

Conformance and Functional Tests Failing Inconsistently #3433

mpstefan opened this issue Jun 2, 2025 · 3 comments · Fixed by #3446
Labels
refined Requirements are refined and the issue is ready to be implemented. tests Pull requests that update tests
Milestone

Comments

@mpstefan
Copy link
Member

mpstefan commented Jun 2, 2025

Both conformance and functional tests are failing inconsistently. We need an investigation into if there is a common root cause amongst failures. We see timeouts and many other types of failures.

Acceptance

  • Conformance and functional tests pass consistently on a known good PR.

Possible Causes

  • Pods not ready in time
  • NGINX Fails to start
  • Probably not infrastructure inconsistencies
@mpstefan mpstefan added this to the v2.1.0 milestone Jun 2, 2025
@mpstefan mpstefan added the enhancement New feature or request label Jun 2, 2025
@mpstefan mpstefan added tests Pull requests that update tests refined Requirements are refined and the issue is ready to be implemented. and removed enhancement New feature or request labels Jun 2, 2025
@salonichf5 salonichf5 moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Jun 2, 2025
@bjee19
Copy link
Contributor

bjee19 commented Jun 5, 2025

After a couple days of investigation on the conformance tests, findings were limited and I could not pinpoint the cause of the flakey conformance tests. Here are some details on my findings:

After about 50-60 local test runs, with variations including: keeping the same NGF instance between test runs, deleting and restarting the NGF instance between test runs, experimental tests on/off, NGINX OSS or Plus, all test runs passed.

Compiling information from 15 or so pipeline runs, it seems like Kubernetes version, NGINX OSS or Plus, or experimental tests on/off did not have a major influence on the success of a conformance test run. Slightly more pipeline runs with experimental tests on failed, however I think that could be just because they run some more tests, increasing the likelihood of a flakey failure.

There was no pattern of a specific conformance test case failing among failed conformance test runs.

Some cases of errors that I saw:

Some pipeline runs would pass completely, so I am setting a current failure rate of a conformance pipeline run at 1/12 ~= 8%

There is currently some thought that a fix of a bug on the NGINX Agent side might fix some of these issues.

@bjee19
Copy link
Contributor

bjee19 commented Jun 5, 2025

Note: On local and pipeline runs with NGINX Plus, the conformance test still passes even if these errors exist:

msg: ; error: failed to preform API action, NGINX Plus API is not configured. However an excessive amount of these has also lead to the conformance test timing out.

Example job that has those errors, and yet still passes: https://github.com/nginx/nginx-gateway-fabric/actions/runs/15473165211/job/43562767814

@bjee19 bjee19 removed their assignment Jun 5, 2025
@salonichf5
Copy link
Contributor

salonichf5 commented Jun 6, 2025

After investigating for 2 days, I am not able to pin point exactly why the functional tests are failing. I did manual runs of NGF with OSS and plus a lot of times to reciprocate the issue locally but couldn't get the tests that were failing in the pipeline fail locally.

Some of the failing pipeline test failures were -

  1. Upstream settings policy - https://github.com/nginx/nginx-gateway-fabric/actions/runs/15488898408/job/43610283915?pr=3470
    https://github.com/nginx/nginx-gateway-fabric/actions/runs/15496805715/job/43635663627

  2. SnippetsFilter

  3. Graceful recovery tests fails the most number of times - https://github.com/nginx/nginx-gateway-fabric/actions/runs/15488921174/job/43610406013?pr=3426

Upon further investigation using print statements in pipeline, graceful recovery tests fail due to upstreams not being available when failing or working traffic is being checked. I did a manual run with plus and different test scenarios in the suite but couldn't reciprocate the original error.

I have opened a PR to ignore the upstream error message to avoid issues. Once we have a bug fix from Agent team, we can remove this error message and re-verify if the issue still exists.

@salonichf5 salonichf5 removed their assignment Jun 6, 2025
@bjee19 bjee19 moved this from 🏗 In Progress to 🆕 New in NGINX Gateway Fabric Jun 6, 2025
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NGINX Gateway Fabric Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined Requirements are refined and the issue is ready to be implemented. tests Pull requests that update tests
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants