Skip to content

fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive#4692

Open
MohammedShetaya wants to merge 1 commit into
argoproj:masterfrom
MohammedShetaya:fix-job-metrics
Open

fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive#4692
MohammedShetaya wants to merge 1 commit into
argoproj:masterfrom
MohammedShetaya:fix-job-metrics

Conversation

@MohammedShetaya
Copy link
Copy Markdown

@MohammedShetaya MohammedShetaya commented Apr 14, 2026

Description

This PR introduces an early-detection mechanism in the job metric provider for terminal pod errors, such as ErrImagePull. Currently, jobs that can't pull their image wait indefinitely without ever setting a Failed condition, causing the analysis to silently succeed and promote the canary.

Changes

  • The JobProvider.Resume function is now actively called during analysis reconciliation. Previously, it was only invoked on manual resumes.
  • Resume now inspects pod container statuses (including init containers) for terminal waiting states (ErrImagePull, ImagePullBackOff, InvalidImageName) and marks the measurement as Inconclusive immediately.
  • The ResumeAt interval grows with the job execution time using √(2t). This matches the nature of the targeted errors — they surface early, so frequent checks are needed at job start, but not later.
  • In-progress measurements are no longer skipped during termination due to a future ResumeAt, ensuring Terminate() is always called.
  • Unit/E2E tests were added/updated to cover the new cases. The E2E fixture was made deterministic by using an indefinite pause instead of a timed one.

What is out of scope?

  • Long-running jobs are still an issue in the current implementation; they report an inconclusive state without any effect on the rollout lifecycle.
  • A change needs to be discussed and implemented on how background analysis should be treated for jobs since they are different in nature than other metric types, which are not expected to run for much time (either immediately return or time out). Job metrics, on the other hand, are expected to run for some time.
  • Job metrics are the only ones hitting the Provider.Terminate function, so a change can be made to do extra analysis after the metric is terminated, for example, (condition xyz should be met to promote the rollout after termination of the running metrics)

Closes: #3562
Help with: #2250

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • My builds are green. Try syncing with master if they are not.
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • I have run all tests locally (including the flaky ones) and they pass on my workstation
  • I have used LLM/AI/Agent tools for this PR but I am responsible for all code of this PR
  • I understand what the code does and WHY/HOW it works in several scenarios
  • I know if my code is just adding new functionality or changing old functionality for existing users
  • My organization is added to USERS.md.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 14, 2026

Published E2E Test Results

  4 files    4 suites   3h 51m 58s ⏱️
122 tests 113 ✅  7 💤 2 ❌
490 runs  460 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 0ef642d.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 86.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.08%. Comparing base (2ccdae8) to head (0ef642d).

Files with missing lines Patch % Lines
metricproviders/job/job.go 88.88% 2 Missing and 2 partials ⚠️
controller/controller.go 66.66% 1 Missing and 1 partial ⚠️
cmd/rollouts-controller/main.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4692      +/-   ##
==========================================
+ Coverage   85.03%   85.08%   +0.04%     
==========================================
  Files         164      164              
  Lines       18989    19032      +43     
==========================================
+ Hits        16148    16194      +46     
+ Misses       1993     1992       -1     
+ Partials      848      846       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 14, 2026

Published Unit Test Results

2 472 tests   2 472 ✅  3m 21s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit 0ef642d.

♻️ This comment has been updated with latest results.

@kostis-codefresh
Copy link
Copy Markdown
Member

I think we agreed that a job that cannot start would be inconclusive. Correct?

Now this is a major breaking change.

Initially we had failed job -> Succesful analysis. But there are several bugs (and I agree with them ) that this doesn't make any sense.
So with my PR we said it is inconclusive (which matches user expectations)

But with this PR (please correct me if I am missing something), a job that does not start fails the analysis metric.

I think that pausing the Rollout when an inconclusive analysis happens is a good design choice.

If I run a deployment with Argo Rollouts using smoke tests for an analysis (in a Job metric), I think there is a big difference between

  1. The smoke tests run, finished, and returned a business error (fail the deployment)
  2. The smoke tests had an error and could not run. The main deployment might actually be ok

In the second case I would expect an inconclusive run that allows me to intervene.

Comment thread test/e2e/analysis_test.go Outdated
PromoteRollout().
WaitForRolloutStatus("Healthy").
WaitForBackgroundAnalysisRunPhase("Inconclusive").
WaitForBackgroundAnalysisRunPhase("Failed").
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important change. I think that now users will open new issues complaining that when an analysis has failed (as a job) it is not clear if the job itself has a "terminal error" or if the job actually run and it found an error (i.e. the smoke tests did not pass)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree, it will cause more confusion. I've changed the behavior to be inconclusive. There's still one unhandled case where the job is a long-running job (it still promotes), however, this shouldn't block this PR. I will follow up on it later.

@sonarqubecloud
Copy link
Copy Markdown

@MohammedShetaya MohammedShetaya changed the title fix(analysis): fail early on failed jobs because of infrastructure errors fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive Apr 28, 2026
@kostis-codefresh kostis-codefresh added the needs-follow-up Used when a maintainer needs to follow up label May 5, 2026
Copy link
Copy Markdown
Member

@kostis-codefresh kostis-codefresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some specific comments on the code. But I also have two general comments.

The job documentation page should also be updated to explain what the code assumes and how it behaves. At the very least we should now mention that job errors are instanly detected and make the analysis inconclosive

I am also really confused about the tests added/update from this PR. I would expect to see only brand new tests and no changes in existing tests. The main behavior regarding the result was already updated by the PR here https://github.com/argoproj/argo-rollouts/pull/4602/files#diff-84b1a3050cb2b346abd90a02e6c89c812bf8917dd0306c18199a0928d57f3d67L225

As a sidenote, here are all the tests that I run locally and that I would expect to see for this area https://github.com/kostis-codefresh/testing-argo-rollouts/tree/main/job-analysis in respective e2e tests

The tests this PR should add would be Faster detection of failed jobs (instead of waiting the timeout or forever). But not changing existing tests. Unless I am missing something here.

}
elapsed := timeutil.MetaNow().Sub(measurement.StartedAt.Time)
// growing delay with respect to the elapsed time since the job started
delay := time.Duration(math.Sqrt(2*elapsed.Seconds()) * float64(time.Second))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use math.Sqrt here instead of the standard wait/backoff k8s package?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait.Backoff would require persisting its state on Measurement

Comment thread metricproviders/job/job.go Outdated
if err != nil {
measurement.FinishedAt = &now
measurement.Message = err.Error()
measurement.Phase = v1alpha1.AnalysisPhaseInconclusive
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are finished and have an error and have decided already about the measurement being inconclosive don't we need to return straigth away? Do we miss a return statement here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was missing

return measurement
}

var terminalWaitingReasons = map[string]bool{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a specific comment here explaining that for these types of errors we clearly decide that the job is inconclusive and we want a human to intervene instead of having the rollout progress.

The "terminal" wording here implies that the rollout itself is/will be terminated and this is not the case.

// such as ErrImagePull in one of the pod containers
func processJobPods(pods *v1.PodList) error {
for _, pod := range pods.Items {
for _, cs := range pod.Status.InitContainerStatuses {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you also tested the code with an init container example? Or just the main container of a job?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. Just saw the unit test.

assert.Nil(t, measurement.FinishedAt)
}

func TestResumeRunningJob(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this test exactly? Happy path scenario? Can we add a comment?

# The job has a deadline of 10 seconds which is less than the 30 second pause, so the job should be terminated
# and the analysis should be failed.
# Rollout with an inline analysis that uses a non-existent docker image.
# The job pod stays in ErrImagePull/ImagePullBackOff/InvalidImageName, which causes the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.

selector:
matchLabels:
app: rollout-background-long-running-job
app: rollout-background-invalid-job
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.

It is two separate tests

  1. a Job the never finished (long-running) but hasn't thrown an error just yet
  2. A Job that has an error that we detect during the course of the rollouit

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test case, I did two changes:

  • fixed the wrong selector
  • needed something to make the RO wait until the inconclusive state pops up, without adding a pause step to make sure the pause state is because of the inconclusive analysis.

For the two cases you mentioned above

  • a Job that never finished (long-running) but hasn't thrown an error just yet is covered in TestCanaryBackgroundAnalysisLongRunningJob.
  • A Job that has an error that we detect during the course of the rollout is covered in TestCanaryBackgroundAnalysisInvalidImageJob

provider:
job:
spec:
activeDeadlineSeconds: 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this test. Why was it removed? It tests the scenario of having a long running job AND defining activeDeadlineSeconds in the rollout analysis

Comment thread test/e2e/analysis_test.go
assert.Equal(s.T(), v1alpha1.AnalysisPhaseInconclusive, metricResult.Phase)
if len(metricResult.Measurements) > 0 {
measurement := metricResult.Measurements[len(metricResult.Measurements)-1]
// The job that was terminated due to deadline should be Failed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we seem to change a test of a scenario to another scenario. I think this was testing having a deadline AND also having a long running job. I think this test should stay the same and any new behavior should be tested on another test.

Comment thread metricproviders/job/job.go Outdated

labelSelector := metav1.FormatLabelSelector(job.Spec.Selector)
if labelSelector != "" && labelSelector != "<none>" {
podList, err := p.kubeclientset.CoreV1().Pods(jobName.Namespace).List(context.TODO(), metav1.ListOptions{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know the performance implications for hitting directly the K8s API every time this code runs? Maybe we need to use a watcher/lister/caching mechanism instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intuition was that it is already scoped down to job pods only, but yeah, I think it makes more sense to have an informer for this.
Implemented ✅

@kostis-codefresh kostis-codefresh removed the needs-follow-up Used when a maintainer needs to follow up label May 7, 2026
@MohammedShetaya
Copy link
Copy Markdown
Author

All the comments related to tests point to deleting the analysis template rollout-background-invalid-job and changing the behavior of its test case. Here's why I did these changes:

  • invalid-image-job-deadline was deleted because its only purpose was to give the old test a way to eventually terminate an ErrImagePull pod via activeDeadlineSeconds. With the new short-circuit, an invalid image is detected and the analysis is marked Inconclusive long before the deadline can fire, so a deadline-bound invalid-image template no longer has any observable effect that's distinct from any other failing template. In other words, the test it supported was really testing "deadline kills a stuck Job → analysis Failed" ("always" because the deadline was less than the pause time), and the invalid image was just a contrived way to keep the Job stuck.

  • Knowing that, I don't think this test case will be different from a test case that just returns a failure. I still added it back as an actual long-running job with a deadline and included it in the TestCanaryInlineAnalysisLongRunningJobWithDeadline test case, and used clearer naming for it.

  • TestCanary{Inline,Background}AnalysisInvalidImageJob now exercises the actual new behavior > immediate Inconclusive from an ErrImagePull pod with the rollout transitioning to Paused for InconclusiveAnalysisRun — without needing a deadline as scaffolding.

Signed-off-by: Mohammed Shetaya <mohammed.shetaya@procore.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Job Analysis Run never fails when the job image can't be pulled (it just succeeds and promotes).

2 participants