Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider high risk evaluation result from other unrelated prs #2317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xueqzhan
Copy link
Contributor

@xueqzhan xueqzhan commented Feb 5, 2025

With this there are still a few caveats:

  1. Some jobs are missing risk analysis file. Those jobs failed to contact sippy during analysis time. I assume those are the ones without external network access. A few examples:
  • pull-ci-openshift-installer-main-e2e-vsphere-host-groups-ovn-custom-no-upgrade
  • pull-ci-openshift-console-master-okd-scos-e2e-aws-ovn
  1. RA does not deal with junits created by sippy (e.g. install units) or the ones created by gather extra (e.g. operator related). But PR commenting still picks them up. Eventually we probably want to filter them out.
  2. RA does not deal with some non-openshift-tests tests. Some examples:
  • Hypershift tests: pull-ci-openshift-cluster-version-operator-main-e2e-hypershift
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2025
Copy link
Contributor

openshift-ci bot commented Feb 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2025
if summary.RiskLevel == api.FailureRiskLevelHigh {
riskAnalysisPRTestRiskMetric.WithLabelValues(org, repo, number, jobName, jobID, testSummary.Name).Set(float64(testSummary.Risk.Level.Level))
} else {
riskAnalysisPRTestRiskMetric.DeleteLabelValues(org, repo, number, jobName, jobID, testSummary.Name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled to unset the metric gauge. At first I thought this would work. But in reality, failed test might pass in the next run and will not be evaluated for risk analysis. That means they will not be in the summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need a separate daemon processor for metrics. Something that could look at the previous 24 hours and track the active count until it dropped off to 0 for 24 hours or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be better to leave the metrics out and focus that work on TRT-1704

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we only set this for FailureRiskLevelHigh. I would be interested in a metric that published the count for how many failures we have seen recently for a High Risk Failure, the greater the count the more likely we are detecting a regression. Again, this work would likely be better in the other card.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2025
Copy link
Contributor

openshift-ci bot commented Feb 5, 2025

@xueqzhan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -280,6 +287,7 @@ func JobRunRiskAnalysis(dbc *db.DB, jobRun *models.ProwJobRun, jobRunTestCount i
logger.WithError(err).Errorf("Failed to find matching jobIds for: %s", jobRun.ProwJob.Name)
}
}
logger.Infof("Found %d unfilered matching jobs %d runs for: %s\njobs %+v", len(jobNames), totalJobRuns, jobRun.ProwJob.Name, jobNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/unfilered/unfiltered

@@ -553,6 +572,55 @@ func runTestRunAnalysis(failedTest models.ProwJobRunTest, jobRun *models.ProwJob
return analysis, nil
}

func isHighRiskInOtherPRs(bqc *bigquery.Client, failedTest models.ProwJobRunTest, jobRun *models.ProwJobRun) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running locally? I would be curious if we could find a case or fake one out to run through this logic.

@@ -161,7 +167,7 @@ func findReleaseMatchJobNames(dbc *db.DB, jobRun *models.ProwJobRun, compareRele
}

if len(jobs) > 0 {
logger.Infof("Found %d matches with: %s", len(jobs), name)
logger.Infof("Found %d matches with: %s\njobs %+v", len(jobs), name, jobs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of logging isn't it? An array of jobs each time within a loop...

}
rowCount = values[0].(int64)
if rowCount > 0 {
log.Infof("High risk items found in other PRs for job %s test '%s'", jobRun.ProwJob.Name, failedTest.Test.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the count to the logging? Maybe even returning the count so we could use it in a metric

analysis.Risk = apitype.TestFailureRisk{
Level: apitype.FailureRiskLevelNone,
Reasons: []string{
"High risk was identified in other PRs first",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this shouldn't go straight to none but instead be set to Medium with the reason indicating that the test may have regressed external to the current PR. Potential external regression detected for High Risk Test analysis

@@ -106,6 +111,7 @@ func FetchJobRun(dbc *db.DB, jobRunID int64, logger *log.Entry) (*models.ProwJob
// Load the ProwJobRun, ProwJob, and failed tests:
// TODO: we may want to expand to analyzing flakes here in the future
res := dbc.DB.Joins("ProwJob").
Preload("PullRequests").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this something that is being used already and is performance boost or left over from previous work (I didn't see a new reference to pull requests but might be missing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants