fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746
fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746cgreeno wants to merge 4 commits into
Conversation
… of crashing. Fixes argoproj#4276 Today, any Datadog v2 scalar query that uses `by {tag}` crashes in parseResponseV2. The v2 scalar API returns multiple columns for grouped queries - a string "group" column holding the tag values and a "number" column holding the per-group scalars - but the existing parser typed Values as []*float64, so the very first JSON unmarshal panics on the string column. This change makes the parser locate the number column by Type, drop nulls, and return: - A scalar (float64) when the response has a single value, preserving the existing successCondition: result < X style for ungrouped queries. - A []float64 when the response has multiple values, mirroring how the Prometheus provider surfaces vector results. Users compose conditions with standard Expr functions like max(result) < X, mean(result), all(result, # < X), or any(result, # >= X). When the query is grouped, the measurement also exposes a metadata.groups field with comma-separated name=value pairs so operators can map an outlier in result back to the entity that produced it. No new API field, no CRD/proto changes. Pre-existing ungrouped queries are byte-identical in behavior; pre-existing grouped queries crashed and now work. Signed-off-by: Chris Greeno <cgreeno@gmail.com>
…error on bad numeric; JSON-encode metadata.groups
Address three issues found in code review of the grouped-query parse fix:
1. Null alignment bug. With separate Numeric() and Strings() helpers,
null entries in the number column were skipped but the
corresponding group entry was kept, shifting the labels off by one
for any value after a null. Replaced with a single extractValues
walk that drops the matching group entry alongside any null number
and keeps the parallel slices in sync.
2. Silent drop on bad numeric. A non-null, non-numeric entry in a
number column used to be quietly dropped, which would cause the
analysis to evaluate against an incomplete result and potentially
ship a bad rollout. Now returns an explicit error naming the
offending column.
3. CSV-encoded metadata.groups is fragile. Datadog tag values can
legally contain `,` and `=`, which would corrupt the previous
`name=value,name=value` format. Switched to a JSON array of
{name, value} objects so notification templates can fromJson it
cleanly.
Also removes a vestigial reflect.ValueOf(...).IsZero() check that was
made dead by the new column-length and numColIdx logic.
New tests cover: null-leading number column with correct pairing,
non-numeric entries returning an error, group-only response (no
number column), reversed column ordering, legacy responses without a
type field. Package coverage rises to 92.6%; new helpers at 100%.
Signed-off-by: Chris Greeno <cgreeno@gmail.com>
Published E2E Test Results 4 files 4 suites 3h 51m 59s ⏱️ For more details on these failures, see this check. Results for commit a1bdf16. ♻️ This comment has been updated with latest results. |
Published Unit Test Results2 483 tests 2 483 ✅ 3m 20s ⏱️ Results for commit a1bdf16. ♻️ This comment has been updated with latest results. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4746 +/- ##
==========================================
+ Coverage 85.01% 85.12% +0.10%
==========================================
Files 164 164
Lines 18989 19026 +37
==========================================
+ Hits 16144 16195 +51
+ Misses 1995 1988 -7
+ Partials 850 843 -7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ve complexity
SonarCloud flagged extractValues as cognitive complexity 20 (allowed
15, rule go:S3776) and noted an unnecessary named-variable in the
inner json.Unmarshal block (rule godre:S8193). Both are addressed by
extracting two small helpers:
- findColumns: locates the first numeric and first group column in
one pass. Removes the dual-purpose loop from the top of
extractValues.
- groupNameAt: returns the tag name at an index in a group column, or
("", false) if the column is missing/short/non-string. Removes the
three-level nesting (nil check + bounds check + unmarshal-error
check) inside extractValues.
extractValues itself is now a single linear walk over the number
column. No behavior change; all existing tests pass and the new
groupNameAt helper is covered by 4 focused subtests (nil column,
index past end, valid string entry, non-string entry).
Package coverage 92.7%, all new helpers at 100%.
Signed-off-by: Chris Greeno <cgreeno@gmail.com>
|
|
@kostis-codefresh - Thanks again for the previous review and guidance! Hopefully this looks a bit better? If not feedback more than welcome. :) |



Checklist:
Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g.
"fix(controller): Updates such and such. Fixes #1234".I've signed my commits with DCO
My builds are green. Try syncing with master if they are not.
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
I have run all tests locally (including the flaky ones) and they pass on my workstation
I have used LLM/AI/Agent tools for this PR but I am responsible for all code of this PR
I understand what the code does and WHY/HOW it works in several scenarios
I know if my code is just adding new functionality or changing old functionality for existing users
My organization is added to USERS.md.
Changes
Background: previous attempt was #4731. Closed after @kostis-codefresh's review — he correctly pointed out that a custom
reducerfield reinvents what Argo Rollouts already does in other providers using the Expr library. This PR is the refactor.Why it currently crashes
Datadog v2's
/api/v2/query/scalarendpoint returns multiple columns when a query usesby {tag}— a stringgroupcolumn holding the tag values and anumbercolumn holding the per-group scalars. The current parser typesValuesas[]*float64, so the very first JSON unmarshal fails when it hits the group column's strings:The fix
Values []json.RawMessageper column, withNumeric()andStrings()helpers that filter to the typed values caller cares about. Nulls (Datadog's "no point" marker) are dropped.parseResponseV2locates theType=="number"column and returns a scalar for single-value responses (preserving today'ssuccessCondition: result < Xsemantics) or[]float64for grouped responses, matching how the Prometheus provider passes a[]float64toEvaluateResultfor vector results (metricproviders/prometheus/prometheus.go:155-188). Note: unlike Prometheus, single-value Datadog queries still return a scalar so the existingsuccessCondition: result < Xconfigurations keep working unchanged.Example
If any endpoint exceeds 5% errors, the canary fails. Any Expr function works (
mean,all,any,sum,median,len, etc.) — full list in the Expr docs.Per-group identity in metadata
For grouped queries, the measurement exposes a
metadata.groupsfield withtag=valuepairs so operators can map an outlier inresultback to the producing entity:This is purely additive — visible in
kubectl describe analysisrun, the rollouts dashboard, and notification templates. If reviewers would prefer themetadata.groupspiece split into a focused follow-up PR, happy to do that and keep this PR strictly the parse fix.Backward compatibility
typefield on columns (existing test fixtures and any older caller) are treated as numeric, preserving the previous "first column wins" behavior.Valuesis now[]json.RawMessage.Files
metricproviders/datadog/datadog.go— response struct +Numeric/Strings/labelGroups/formatValueSlicehelpers +parseResponseV2refactor.metricproviders/datadog/datadog_test.go,datadogV2_test.go— tests.docs/analysis/datadog.md— documents grouped-query usage and themetadata.groupsfield.