fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276 by cgreeno · Pull Request #4746 · argoproj/argo-rollouts

cgreeno · 2026-05-27T15:14:36Z

Checklist:

Changes

Background: previous attempt was #4731. Closed after @kostis-codefresh's review — he correctly pointed out that a custom reducer field reinvents what Argo Rollouts already does in other providers using the Expr library. This PR is the refactor.

Why it currently crashes

Datadog v2's /api/v2/query/scalar endpoint returns multiple columns when a query uses by {tag} — a string group column holding the tag values and a number column holding the per-group scalars. The current parser types Values as []*float64, so the very first JSON unmarshal fails when it hits the group column's strings:

Could not parse JSON body: json: cannot unmarshal string into Go struct field
.Data.Attributes.Columns.Values of type []*float64

The fix

Response struct now keeps Values []json.RawMessage per column, with Numeric() and Strings() helpers that filter to the typed values caller cares about. Nulls (Datadog's "no point" marker) are dropped.
parseResponseV2 locates the Type=="number" column and returns a scalar for single-value responses (preserving today's successCondition: result < X semantics) or []float64 for grouped responses, matching how the Prometheus provider passes a []float64 to EvaluateResult for vector results (metricproviders/prometheus/prometheus.go:155-188). Note: unlike Prometheus, single-value Datadog queries still return a scalar so the existing successCondition: result < X configurations keep working unchanged.
No new API field. Users compose conditions in Expr just like with Prometheus/Wavefront vector results.

Example

metrics:
- name: per-endpoint-error-rate
  interval: 30s
  successCondition: max(result) < 0.05
  failureLimit: 3
  provider:
    datadog:
      apiVersion: v2
      interval: 5m
      query: "sum:trace.http.request.errors{service:my-service} by {resource_name}.as_count()"

If any endpoint exceeds 5% errors, the canary fails. Any Expr function works (mean, all, any, sum, median, len, etc.) — full list in the Expr docs.

Per-group identity in metadata

For grouped queries, the measurement exposes a metadata.groups field with tag=value pairs so operators can map an outlier in result back to the producing entity:

Phase: Failed
Value: [0.012,0.087,0.041]
Metadata:
  groups: "POST /a=0.012,POST /b=0.087,POST /c=0.041"

This is purely additive — visible in kubectl describe analysisrun, the rollouts dashboard, and notification templates. If reviewers would prefer the metadata.groups piece split into a focused follow-up PR, happy to do that and keep this PR strictly the parse fix.

Backward compatibility

Ungrouped queries are byte-identical in behavior.
Legacy responses without a type field on columns (existing test fixtures and any older caller) are treated as numeric, preserving the previous "first column wins" behavior.
The one existing test asserting the old JSON-unmarshal error string needed its expectation updated because Values is now []json.RawMessage.

Files

metricproviders/datadog/datadog.go — response struct + Numeric/Strings/labelGroups/formatValueSlice helpers + parseResponseV2 refactor.
metricproviders/datadog/datadog_test.go, datadogV2_test.go — tests.
docs/analysis/datadog.md — documents grouped-query usage and the metadata.groups field.

… of crashing. Fixes argoproj#4276 Today, any Datadog v2 scalar query that uses `by {tag}` crashes in parseResponseV2. The v2 scalar API returns multiple columns for grouped queries - a string "group" column holding the tag values and a "number" column holding the per-group scalars - but the existing parser typed Values as []*float64, so the very first JSON unmarshal panics on the string column. This change makes the parser locate the number column by Type, drop nulls, and return: - A scalar (float64) when the response has a single value, preserving the existing successCondition: result < X style for ungrouped queries. - A []float64 when the response has multiple values, mirroring how the Prometheus provider surfaces vector results. Users compose conditions with standard Expr functions like max(result) < X, mean(result), all(result, # < X), or any(result, # >= X). When the query is grouped, the measurement also exposes a metadata.groups field with comma-separated name=value pairs so operators can map an outlier in result back to the entity that produced it. No new API field, no CRD/proto changes. Pre-existing ungrouped queries are byte-identical in behavior; pre-existing grouped queries crashed and now work. Signed-off-by: Chris Greeno <cgreeno@gmail.com>

…error on bad numeric; JSON-encode metadata.groups Address three issues found in code review of the grouped-query parse fix: 1. Null alignment bug. With separate Numeric() and Strings() helpers, null entries in the number column were skipped but the corresponding group entry was kept, shifting the labels off by one for any value after a null. Replaced with a single extractValues walk that drops the matching group entry alongside any null number and keeps the parallel slices in sync. 2. Silent drop on bad numeric. A non-null, non-numeric entry in a number column used to be quietly dropped, which would cause the analysis to evaluate against an incomplete result and potentially ship a bad rollout. Now returns an explicit error naming the offending column. 3. CSV-encoded metadata.groups is fragile. Datadog tag values can legally contain `,` and `=`, which would corrupt the previous `name=value,name=value` format. Switched to a JSON array of {name, value} objects so notification templates can fromJson it cleanly. Also removes a vestigial reflect.ValueOf(...).IsZero() check that was made dead by the new column-length and numColIdx logic. New tests cover: null-leading number column with correct pairing, non-numeric entries returning an error, group-only response (no number column), reversed column ordering, legacy responses without a type field. Package coverage rises to 92.6%; new helpers at 100%. Signed-off-by: Chris Greeno <cgreeno@gmail.com>

github-actions · 2026-05-27T16:15:11Z

Published E2E Test Results

4 files 4 suites 3h 51m 59s ⏱️
120 tests 106 ✅ 7 💤 7 ❌
496 runs 454 ✅ 28 💤 14 ❌

For more details on these failures, see this check.

Results for commit a1bdf16.

♻️ This comment has been updated with latest results.

github-actions · 2026-05-27T16:15:19Z

Published Unit Test Results

2 483 tests 2 483 ✅ 3m 20s ⏱️
129 suites 0 💤
1 files 0 ❌

Results for commit a1bdf16.

♻️ This comment has been updated with latest results.

codecov · 2026-05-27T16:16:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.12%. Comparing base (017676b) to head (a1bdf16).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4746      +/-   ##
==========================================
+ Coverage   85.01%   85.12%   +0.10%     
==========================================
  Files         164      164              
  Lines       18989    19026      +37     
==========================================
+ Hits        16144    16195      +51     
+ Misses       1995     1988       -7     
+ Partials      850      843       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ve complexity SonarCloud flagged extractValues as cognitive complexity 20 (allowed 15, rule go:S3776) and noted an unnecessary named-variable in the inner json.Unmarshal block (rule godre:S8193). Both are addressed by extracting two small helpers: - findColumns: locates the first numeric and first group column in one pass. Removes the dual-purpose loop from the top of extractValues. - groupNameAt: returns the tag name at an index in a group column, or ("", false) if the column is missing/short/non-string. Removes the three-level nesting (nil check + bounds check + unmarshal-error check) inside extractValues. extractValues itself is now a single linear walk over the number column. No behavior change; all existing tests pass and the new groupNameAt helper is covered by 4 focused subtests (nil column, index past end, valid string entry, non-string entry). Package coverage 92.7%, all new helpers at 100%. Signed-off-by: Chris Greeno <cgreeno@gmail.com>

sonarqubecloud · 2026-05-29T08:17:10Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cgreeno · 2026-05-29T08:18:33Z

@kostis-codefresh - Thanks again for the previous review and guidance!

Hopefully this looks a bit better? If not feedback more than welcome. :)

cgreeno added 2 commits May 27, 2026 16:12

cgreeno added 2 commits May 28, 2026 14:27

Merge branch 'master' into fix-datadog-grouped-parse

a1bdf16

cgreeno marked this pull request as ready for review May 29, 2026 08:18

cgreeno requested a review from a team as a code owner May 29, 2026 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746

fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746
cgreeno wants to merge 4 commits into
argoproj:masterfrom
cgreeno:fix-datadog-grouped-parse

cgreeno commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Uh oh!

cgreeno commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cgreeno commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Why it currently crashes

The fix

Example

Per-group identity in metadata

Backward compatibility

Files

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Published E2E Test Results

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Published Unit Test Results

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Quality Gate passed

Uh oh!

cgreeno commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cgreeno commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading