fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive#4692
fix(analysis): detect infrastructure errors early and mark job analysis as inconclusive#4692MohammedShetaya wants to merge 1 commit into
Conversation
Published E2E Test Results 4 files 4 suites 3h 51m 58s ⏱️ For more details on these failures, see this check. Results for commit 0ef642d. ♻️ This comment has been updated with latest results. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4692 +/- ##
==========================================
+ Coverage 85.03% 85.08% +0.04%
==========================================
Files 164 164
Lines 18989 19032 +43
==========================================
+ Hits 16148 16194 +46
+ Misses 1993 1992 -1
+ Partials 848 846 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Published Unit Test Results2 472 tests 2 472 ✅ 3m 21s ⏱️ Results for commit 0ef642d. ♻️ This comment has been updated with latest results. |
5aba81d to
248be45
Compare
|
I think we agreed that a job that cannot start would be inconclusive. Correct? Now this is a major breaking change. Initially we had failed job -> Succesful analysis. But there are several bugs (and I agree with them ) that this doesn't make any sense. But with this PR (please correct me if I am missing something), a job that does not start fails the analysis metric. I think that pausing the Rollout when an inconclusive analysis happens is a good design choice. If I run a deployment with Argo Rollouts using smoke tests for an analysis (in a Job metric), I think there is a big difference between
In the second case I would expect an inconclusive run that allows me to intervene. |
| PromoteRollout(). | ||
| WaitForRolloutStatus("Healthy"). | ||
| WaitForBackgroundAnalysisRunPhase("Inconclusive"). | ||
| WaitForBackgroundAnalysisRunPhase("Failed"). |
There was a problem hiding this comment.
This is an important change. I think that now users will open new issues complaining that when an analysis has failed (as a job) it is not clear if the job itself has a "terminal error" or if the job actually run and it found an error (i.e. the smoke tests did not pass)
There was a problem hiding this comment.
Thanks! I agree, it will cause more confusion. I've changed the behavior to be inconclusive. There's still one unhandled case where the job is a long-running job (it still promotes), however, this shouldn't block this PR. I will follow up on it later.
0262c87 to
9a75548
Compare
|
kostis-codefresh
left a comment
There was a problem hiding this comment.
I added some specific comments on the code. But I also have two general comments.
The job documentation page should also be updated to explain what the code assumes and how it behaves. At the very least we should now mention that job errors are instanly detected and make the analysis inconclosive
I am also really confused about the tests added/update from this PR. I would expect to see only brand new tests and no changes in existing tests. The main behavior regarding the result was already updated by the PR here https://github.com/argoproj/argo-rollouts/pull/4602/files#diff-84b1a3050cb2b346abd90a02e6c89c812bf8917dd0306c18199a0928d57f3d67L225
As a sidenote, here are all the tests that I run locally and that I would expect to see for this area https://github.com/kostis-codefresh/testing-argo-rollouts/tree/main/job-analysis in respective e2e tests
The tests this PR should add would be Faster detection of failed jobs (instead of waiting the timeout or forever). But not changing existing tests. Unless I am missing something here.
| } | ||
| elapsed := timeutil.MetaNow().Sub(measurement.StartedAt.Time) | ||
| // growing delay with respect to the elapsed time since the job started | ||
| delay := time.Duration(math.Sqrt(2*elapsed.Seconds()) * float64(time.Second)) |
There was a problem hiding this comment.
Why do we use math.Sqrt here instead of the standard wait/backoff k8s package?
There was a problem hiding this comment.
Wait.Backoff would require persisting its state on Measurement
| if err != nil { | ||
| measurement.FinishedAt = &now | ||
| measurement.Message = err.Error() | ||
| measurement.Phase = v1alpha1.AnalysisPhaseInconclusive |
There was a problem hiding this comment.
If we are finished and have an error and have decided already about the measurement being inconclosive don't we need to return straigth away? Do we miss a return statement here?
| return measurement | ||
| } | ||
|
|
||
| var terminalWaitingReasons = map[string]bool{ |
There was a problem hiding this comment.
I would add a specific comment here explaining that for these types of errors we clearly decide that the job is inconclusive and we want a human to intervene instead of having the rollout progress.
The "terminal" wording here implies that the rollout itself is/will be terminated and this is not the case.
| // such as ErrImagePull in one of the pod containers | ||
| func processJobPods(pods *v1.PodList) error { | ||
| for _, pod := range pods.Items { | ||
| for _, cs := range pod.Status.InitContainerStatuses { |
There was a problem hiding this comment.
Have you also tested the code with an init container example? Or just the main container of a job?
There was a problem hiding this comment.
Never mind. Just saw the unit test.
| assert.Nil(t, measurement.FinishedAt) | ||
| } | ||
|
|
||
| func TestResumeRunningJob(t *testing.T) { |
There was a problem hiding this comment.
what does this test exactly? Happy path scenario? Can we add a comment?
| # The job has a deadline of 10 seconds which is less than the 30 second pause, so the job should be terminated | ||
| # and the analysis should be failed. | ||
| # Rollout with an inline analysis that uses a non-existent docker image. | ||
| # The job pod stays in ErrImagePull/ImagePullBackOff/InvalidImageName, which causes the |
There was a problem hiding this comment.
This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.
| selector: | ||
| matchLabels: | ||
| app: rollout-background-long-running-job | ||
| app: rollout-background-invalid-job |
There was a problem hiding this comment.
This seems to change the test of another scenario. Please keep the previous test (which was specifically about having a deadline) and add a brand new test for your scenario.
It is two separate tests
- a Job the never finished (long-running) but hasn't thrown an error just yet
- A Job that has an error that we detect during the course of the rollouit
There was a problem hiding this comment.
For this test case, I did two changes:
- fixed the wrong selector
- needed something to make the RO wait until the inconclusive state pops up, without adding a pause step to make sure the pause state is because of the inconclusive analysis.
For the two cases you mentioned above
a Job that never finished (long-running) but hasn't thrown an error just yetis covered inTestCanaryBackgroundAnalysisLongRunningJob.A Job that has an error that we detect during the course of the rolloutis covered inTestCanaryBackgroundAnalysisInvalidImageJob
| provider: | ||
| job: | ||
| spec: | ||
| activeDeadlineSeconds: 10 |
There was a problem hiding this comment.
We need this test. Why was it removed? It tests the scenario of having a long running job AND defining activeDeadlineSeconds in the rollout analysis
| assert.Equal(s.T(), v1alpha1.AnalysisPhaseInconclusive, metricResult.Phase) | ||
| if len(metricResult.Measurements) > 0 { | ||
| measurement := metricResult.Measurements[len(metricResult.Measurements)-1] | ||
| // The job that was terminated due to deadline should be Failed |
There was a problem hiding this comment.
Here we seem to change a test of a scenario to another scenario. I think this was testing having a deadline AND also having a long running job. I think this test should stay the same and any new behavior should be tested on another test.
|
|
||
| labelSelector := metav1.FormatLabelSelector(job.Spec.Selector) | ||
| if labelSelector != "" && labelSelector != "<none>" { | ||
| podList, err := p.kubeclientset.CoreV1().Pods(jobName.Namespace).List(context.TODO(), metav1.ListOptions{ |
There was a problem hiding this comment.
do we know the performance implications for hitting directly the K8s API every time this code runs? Maybe we need to use a watcher/lister/caching mechanism instead?
There was a problem hiding this comment.
The intuition was that it is already scoped down to job pods only, but yeah, I think it makes more sense to have an informer for this.
Implemented ✅
|
All the comments related to tests point to deleting the analysis template
|
b55c350 to
8aea9d6
Compare
Signed-off-by: Mohammed Shetaya <mohammed.shetaya@procore.com>
8aea9d6 to
0ef642d
Compare
|



Description
This PR introduces an early-detection mechanism in the job metric provider for terminal pod errors, such as
ErrImagePull. Currently, jobs that can't pull their image wait indefinitely without ever setting aFailedcondition, causing the analysis to silently succeed and promote the canary.Changes
JobProvider.Resumefunction is now actively called during analysis reconciliation. Previously, it was only invoked on manual resumes.Resumenow inspects pod container statuses (including init containers) for terminal waiting states (ErrImagePull,ImagePullBackOff,InvalidImageName) and marks the measurement asInconclusiveimmediately.ResumeAtinterval grows with the job execution time using√(2t). This matches the nature of the targeted errors — they surface early, so frequent checks are needed at job start, but not later.ResumeAt, ensuringTerminate()is always called.What is out of scope?
Provider.Terminatefunction, so a change can be made to do extra analysis after the metric is terminated, for example, (condition xyz should be met to promote the rollout after termination of the running metrics)Closes: #3562
Help with: #2250
Checklist:
"fix(controller): Updates such and such. Fixes #1234".