fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive by MohammedShetaya · Pull Request #4692 · argoproj/argo-rollouts

MohammedShetaya · 2026-04-14T22:23:50Z

Description

This PR introduces an early-detection mechanism in the job metric provider for terminal pod errors, such as ErrImagePull. Currently, jobs that can't pull their image wait indefinitely without ever setting a Failed condition, causing the analysis to silently succeed and promote the canary.

Changes

The JobProvider.Resume function is now actively called during analysis reconciliation. Previously, it was only invoked on manual resumes.
Resume now inspects pod container statuses (including init containers) for terminal waiting states (ErrImagePull, ImagePullBackOff, InvalidImageName) and marks the measurement as Inconclusive immediately.
The ResumeAt interval grows with the job execution time using √(2t). This matches the nature of the targeted errors — they surface early, so frequent checks are needed at job start, but not later.
In-progress measurements are no longer skipped during termination due to a future ResumeAt, ensuring Terminate() is always called.
Unit/E2E tests were added/updated to cover the new cases. The E2E fixture was made deterministic by using an indefinite pause instead of a timed one.

What is out of scope?

Long-running jobs are still an issue in the current implementation; they report an inconclusive state without any effect on the rollout lifecycle.
A change needs to be discussed and implemented on how background analysis should be treated for jobs since they are different in nature than other metric types, which are not expected to run for much time (either immediately return or time out). Job metrics, on the other hand, are expected to run for some time.
Job metrics are the only ones hitting the Provider.Terminate function, so a change can be made to do extra analysis after the metric is terminated, for example, (condition xyz should be met to promote the rollout after termination of the running metrics)

Closes: #3562
Help with: #2250

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
My builds are green. Try syncing with master if they are not.
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
I have run all tests locally (including the flaky ones) and they pass on my workstation
I have used LLM/AI/Agent tools for this PR but I am responsible for all code of this PR
I understand what the code does and WHY/HOW it works in several scenarios
I know if my code is just adding new functionality or changing old functionality for existing users
My organization is added to USERS.md.

github-actions · 2026-04-14T23:02:48Z

Published E2E Test Results

4 files 4 suites 3h 51m 58s ⏱️
122 tests 113 ✅ 7 💤 2 ❌
490 runs 460 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 0ef642d.

♻️ This comment has been updated with latest results.

codecov · 2026-04-14T23:02:53Z

Codecov Report

❌ Patch coverage is 86.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.08%. Comparing base (2ccdae8) to head (0ef642d).

Files with missing lines	Patch %	Lines
metricproviders/job/job.go	88.88%	2 Missing and 2 partials ⚠️
controller/controller.go	66.66%	1 Missing and 1 partial ⚠️
cmd/rollouts-controller/main.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4692      +/-   ##
==========================================
+ Coverage   85.03%   85.08%   +0.04%     
==========================================
  Files         164      164              
  Lines       18989    19032      +43     
==========================================
+ Hits        16148    16194      +46     
+ Misses       1993     1992       -1     
+ Partials      848      846       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-04-14T23:02:55Z

Published Unit Test Results

2 472 tests 2 472 ✅ 3m 21s ⏱️
129 suites 0 💤
1 files 0 ❌

Results for commit 0ef642d.

♻️ This comment has been updated with latest results.

kostis-codefresh · 2026-04-23T11:56:04Z

I think we agreed that a job that cannot start would be inconclusive. Correct?

Now this is a major breaking change.

Initially we had failed job -> Succesful analysis. But there are several bugs (and I agree with them ) that this doesn't make any sense.
So with my PR we said it is inconclusive (which matches user expectations)

But with this PR (please correct me if I am missing something), a job that does not start fails the analysis metric.

I think that pausing the Rollout when an inconclusive analysis happens is a good design choice.

If I run a deployment with Argo Rollouts using smoke tests for an analysis (in a Job metric), I think there is a big difference between

The smoke tests run, finished, and returned a business error (fail the deployment)
The smoke tests had an error and could not run. The main deployment might actually be ok

In the second case I would expect an inconclusive run that allows me to intervene.

kostis-codefresh · 2026-04-23T11:57:36Z

-		PromoteRollout().
-		WaitForRolloutStatus("Healthy").
-		WaitForBackgroundAnalysisRunPhase("Inconclusive").
+		WaitForBackgroundAnalysisRunPhase("Failed").


This is an important change. I think that now users will open new issues complaining that when an analysis has failed (as a job) it is not clear if the job itself has a "terminal error" or if the job actually run and it found an error (i.e. the smoke tests did not pass)

Thanks! I agree, it will cause more confusion. I've changed the behavior to be inconclusive. There's still one unhandled case where the job is a long-running job (it still promotes), however, this shouldn't block this PR. I will follow up on it later.

sonarqubecloud · 2026-04-28T00:32:54Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
3.8% Duplication on New Code

See analysis details on SonarQube Cloud

kostis-codefresh

I added some specific comments on the code. But I also have two general comments.

The job documentation page should also be updated to explain what the code assumes and how it behaves. At the very least we should now mention that job errors are instanly detected and make the analysis inconclosive

I am also really confused about the tests added/update from this PR. I would expect to see only brand new tests and no changes in existing tests. The main behavior regarding the result was already updated by the PR here https://github.com/argoproj/argo-rollouts/pull/4602/files#diff-84b1a3050cb2b346abd90a02e6c89c812bf8917dd0306c18199a0928d57f3d67L225

As a sidenote, here are all the tests that I run locally and that I would expect to see for this area https://github.com/kostis-codefresh/testing-argo-rollouts/tree/main/job-analysis in respective e2e tests

The tests this PR should add would be Faster detection of failed jobs (instead of waiting the timeout or forever). But not changing existing tests. Unless I am missing something here.

kostis-codefresh · 2026-05-07T13:38:34Z

+	}
+	elapsed := timeutil.MetaNow().Sub(measurement.StartedAt.Time)
+	// growing delay with respect to the elapsed time since the job started
+	delay := time.Duration(math.Sqrt(2*elapsed.Seconds()) * float64(time.Second))


Why do we use math.Sqrt here instead of the standard wait/backoff k8s package?

Wait.Backoff would require persisting its state on Measurement

kostis-codefresh · 2026-05-07T13:44:55Z

+		if err != nil {
+			measurement.FinishedAt = &now
+			measurement.Message = err.Error()
+			measurement.Phase = v1alpha1.AnalysisPhaseInconclusive


If we are finished and have an error and have decided already about the measurement being inconclosive don't we need to return straigth away? Do we miss a return statement here?

Yes, this was missing

kostis-codefresh · 2026-05-07T13:46:26Z

 	return measurement
 }

+var terminalWaitingReasons = map[string]bool{


I would add a specific comment here explaining that for these types of errors we clearly decide that the job is inconclusive and we want a human to intervene instead of having the rollout progress.

The "terminal" wording here implies that the rollout itself is/will be terminated and this is not the case.

kostis-codefresh · 2026-05-07T13:47:14Z

+// such as ErrImagePull in one of the pod containers
+func processJobPods(pods *v1.PodList) error {
+	for _, pod := range pods.Items {
+		for _, cs := range pod.Status.InitContainerStatuses {


Have you also tested the code with an init container example? Or just the main container of a job?

Never mind. Just saw the unit test.

kostis-codefresh · 2026-05-07T13:48:00Z

 	assert.Nil(t, measurement.FinishedAt)
 }

+func TestResumeRunningJob(t *testing.T) {


what does this test exactly? Happy path scenario? Can we add a comment?

kostis-codefresh · 2026-05-07T13:51:45Z

-# The job has a deadline of 10 seconds which is less than the 30 second pause, so the job should be terminated 
-# and the analysis should be failed.
+# Rollout with an inline analysis that uses a non-existent docker image.
+# The job pod stays in ErrImagePull/ImagePullBackOff/InvalidImageName, which causes the


This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.

kostis-codefresh · 2026-05-07T13:53:02Z

  selector:
    matchLabels:
-      app: rollout-background-long-running-job
+      app: rollout-background-invalid-job


This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.

It is two separate tests

a Job the never finished (long-running) but hasn't thrown an error just yet

A Job that has an error that we detect during the course of the rollouit

For this test case, I did two changes:

fixed the wrong selector

needed something to make the RO wait until the inconclusive state pops up, without adding a pause step to make sure the pause state is because of the inconclusive analysis.

For the two cases you mentioned above

a Job that never finished (long-running) but hasn't thrown an error just yet is covered in TestCanaryBackgroundAnalysisLongRunningJob.

A Job that has an error that we detect during the course of the rollout is covered in TestCanaryBackgroundAnalysisInvalidImageJob

kostis-codefresh · 2026-05-07T13:53:44Z

-    provider:
-      job:
-        spec:
-          activeDeadlineSeconds: 10


We need this test. Why was it removed? It tests the scenario of having a long running job AND defining activeDeadlineSeconds in the rollout analysis

kostis-codefresh · 2026-05-07T13:56:02Z

+				assert.Equal(s.T(), v1alpha1.AnalysisPhaseInconclusive, metricResult.Phase)
 				if len(metricResult.Measurements) > 0 {
 					measurement := metricResult.Measurements[len(metricResult.Measurements)-1]
-					// The job that was terminated due to deadline should be Failed


Here we seem to change a test of a scenario to another scenario. I think this was testing having a deadline AND also having a long running job. I think this test should stay the same and any new behavior should be tested on another test.

kostis-codefresh · 2026-05-07T13:59:20Z

+
+	labelSelector := metav1.FormatLabelSelector(job.Spec.Selector)
+	if labelSelector != "" && labelSelector != "<none>" {
+		podList, err := p.kubeclientset.CoreV1().Pods(jobName.Namespace).List(context.TODO(), metav1.ListOptions{


do we know the performance implications for hitting directly the K8s API every time this code runs? Maybe we need to use a watcher/lister/caching mechanism instead?

The intuition was that it is already scoped down to job pods only, but yeah, I think it makes more sense to have an informer for this.
Implemented ✅

MohammedShetaya · 2026-05-16T14:54:58Z

All the comments related to tests point to deleting the analysis template rollout-background-invalid-job and changing the behavior of its test case. Here's why I did these changes:

invalid-image-job-deadline was deleted because its only purpose was to give the old test a way to eventually terminate an ErrImagePull pod via activeDeadlineSeconds. With the new short-circuit, an invalid image is detected and the analysis is marked Inconclusive long before the deadline can fire, so a deadline-bound invalid-image template no longer has any observable effect that's distinct from any other failing template. In other words, the test it supported was really testing "deadline kills a stuck Job → analysis Failed" ("always" because the deadline was less than the pause time), and the invalid image was just a contrived way to keep the Job stuck.
Knowing that, I don't think this test case will be different from a test case that just returns a failure. I still added it back as an actual long-running job with a deadline and included it in the TestCanaryInlineAnalysisLongRunningJobWithDeadline test case, and used clearer naming for it.
TestCanary{Inline,Background}AnalysisInvalidImageJob now exercises the actual new behavior > immediate Inconclusive from an ErrImagePull pod with the rollout transitioning to Paused for InconclusiveAnalysisRun — without needing a deadline as scaffolding.

Signed-off-by: Mohammed Shetaya <mohammed.shetaya@procore.com>

sonarqubecloud · 2026-05-16T15:09:54Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
8.4% Duplication on New Code

See analysis details on SonarQube Cloud

MohammedShetaya force-pushed the fix-job-metrics branch from 5aba81d to 248be45 Compare April 14, 2026 23:17

kostis-codefresh reviewed Apr 23, 2026

View reviewed changes

kostis-codefresh self-assigned this Apr 23, 2026

MohammedShetaya force-pushed the fix-job-metrics branch from 0262c87 to 9a75548 Compare April 28, 2026 00:32

MohammedShetaya changed the title ~~fix(analysis): fail early on failed jobs because of infrastructure errors~~ fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive Apr 28, 2026

MohammedShetaya requested a review from kostis-codefresh April 28, 2026 00:43

kostis-codefresh added the needs-follow-up Used when a maintainer needs to follow up label May 5, 2026

kostis-codefresh suggested changes May 7, 2026

View reviewed changes

kostis-codefresh removed the needs-follow-up Used when a maintainer needs to follow up label May 7, 2026

MohammedShetaya force-pushed the fix-job-metrics branch from b55c350 to 8aea9d6 Compare May 16, 2026 14:57

fix: fail early on failed jobs

0ef642d

Signed-off-by: Mohammed Shetaya <mohammed.shetaya@procore.com>

MohammedShetaya force-pushed the fix-job-metrics branch from 8aea9d6 to 0ef642d Compare May 16, 2026 15:08

MohammedShetaya requested a review from kostis-codefresh May 16, 2026 16:05

Conversation

MohammedShetaya commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

What is out of scope?

Checklist:

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Published E2E Test Results

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Published Unit Test Results

Uh oh!

kostis-codefresh commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 28, 2026

Quality Gate passed

Uh oh!

kostis-codefresh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MohammedShetaya commented May 16, 2026

Uh oh!

sonarqubecloud Bot commented May 16, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MohammedShetaya commented Apr 14, 2026 •

edited

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading

codecov Bot commented Apr 14, 2026 •

edited

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading