Skip to content

fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746

Open
cgreeno wants to merge 4 commits into
argoproj:masterfrom
cgreeno:fix-datadog-grouped-parse
Open

fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276#4746
cgreeno wants to merge 4 commits into
argoproj:masterfrom
cgreeno:fix-datadog-grouped-parse

Conversation

@cgreeno
Copy link
Copy Markdown

@cgreeno cgreeno commented May 27, 2026

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.

    (b) bug fix. Today, any Datadog v2 query using by {tag} crashes in parseResponseV2 with json: cannot unmarshal string into Go struct field .Data.Attributes.Columns.Values of type []*float64. Issue Support evaluating grouped Datadog metrics (by {tag}) in AnalysisTemplates #4276 tracks the user-visible symptom.

  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".

    fix(metricprovider): parse grouped Datadog queries (by {tag}) instead of crashing. Fixes #4276.

  • I've signed my commits with DCO

  • My builds are green. Try syncing with master if they are not.

  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.

    4 new table cases in datadogV2_test.go covering grouped queries with max(result) and all(result, # < X), a grouped query with a null entry, and a populated Datadog errors response. Focused unit tests in datadog_test.go for Numeric, Strings, labelGroups, and a parseResponseV2 body-read-error case. Package coverage: 90.2% → 93.0% with helpers at 100%.

  • I have run all tests locally (including the flaky ones) and they pass on my workstation

    go test ./... → 83 packages, 0 failures. golangci-lint run ./metricproviders/datadog/... → 0 issues.

  • I have used LLM/AI/Agent tools for this PR but I am responsible for all code of this PR

  • I understand what the code does and WHY/HOW it works in several scenarios

    1. Ungrouped query (1 value, no group column) → returns a scalar; successCondition: result < X works as before. 2. Empty / all-null response → empty branch, evaluates against nilFloat64 exactly like today. 3. Grouped query → finds the number column by Type=="number", drops nulls, returns []float64 to EvaluateResult for evaluation against Expr functions (max, mean, all, etc). 4. Mismatched response (more values than group names) → labelGroups breaks rather than indexing past the slice.
  • I know if my code is just adding new functionality or changing old functionality for existing users

    Purely additive for working users. Ungrouped queries are byte-identical. Grouped queries crashed before and now work. The one existing test asserting the old JSON-unmarshal error string needed its expectation updated (the response struct now uses []json.RawMessage instead of []*float64).

  • My organization is added to USERS.md.

    Will do as a separate PR per @kostis-codefresh's suggestion on the previous attempt.

Changes

Background: previous attempt was #4731. Closed after @kostis-codefresh's review — he correctly pointed out that a custom reducer field reinvents what Argo Rollouts already does in other providers using the Expr library. This PR is the refactor.

Why it currently crashes

Datadog v2's /api/v2/query/scalar endpoint returns multiple columns when a query uses by {tag} — a string group column holding the tag values and a number column holding the per-group scalars. The current parser types Values as []*float64, so the very first JSON unmarshal fails when it hits the group column's strings:

Could not parse JSON body: json: cannot unmarshal string into Go struct field
.Data.Attributes.Columns.Values of type []*float64

The fix

  • Response struct now keeps Values []json.RawMessage per column, with Numeric() and Strings() helpers that filter to the typed values caller cares about. Nulls (Datadog's "no point" marker) are dropped.
  • parseResponseV2 locates the Type=="number" column and returns a scalar for single-value responses (preserving today's successCondition: result < X semantics) or []float64 for grouped responses, matching how the Prometheus provider passes a []float64 to EvaluateResult for vector results (metricproviders/prometheus/prometheus.go:155-188). Note: unlike Prometheus, single-value Datadog queries still return a scalar so the existing successCondition: result < X configurations keep working unchanged.
  • No new API field. Users compose conditions in Expr just like with Prometheus/Wavefront vector results.

Example

metrics:
- name: per-endpoint-error-rate
  interval: 30s
  successCondition: max(result) < 0.05
  failureLimit: 3
  provider:
    datadog:
      apiVersion: v2
      interval: 5m
      query: "sum:trace.http.request.errors{service:my-service} by {resource_name}.as_count()"

If any endpoint exceeds 5% errors, the canary fails. Any Expr function works (mean, all, any, sum, median, len, etc.) — full list in the Expr docs.

Per-group identity in metadata

For grouped queries, the measurement exposes a metadata.groups field with tag=value pairs so operators can map an outlier in result back to the producing entity:

Phase: Failed
Value: [0.012,0.087,0.041]
Metadata:
  groups: "POST /a=0.012,POST /b=0.087,POST /c=0.041"

This is purely additive — visible in kubectl describe analysisrun, the rollouts dashboard, and notification templates. If reviewers would prefer the metadata.groups piece split into a focused follow-up PR, happy to do that and keep this PR strictly the parse fix.

Backward compatibility

  • Ungrouped queries are byte-identical in behavior.
  • Legacy responses without a type field on columns (existing test fixtures and any older caller) are treated as numeric, preserving the previous "first column wins" behavior.
  • The one existing test asserting the old JSON-unmarshal error string needed its expectation updated because Values is now []json.RawMessage.

Files

  • metricproviders/datadog/datadog.go — response struct + Numeric/Strings/labelGroups/formatValueSlice helpers + parseResponseV2 refactor.
  • metricproviders/datadog/datadog_test.go, datadogV2_test.go — tests.
  • docs/analysis/datadog.md — documents grouped-query usage and the metadata.groups field.

cgreeno added 2 commits May 27, 2026 16:12
… of crashing. Fixes argoproj#4276

Today, any Datadog v2 scalar query that uses `by {tag}` crashes in
parseResponseV2. The v2 scalar API returns multiple columns for grouped
queries - a string "group" column holding the tag values and a "number"
column holding the per-group scalars - but the existing parser typed
Values as []*float64, so the very first JSON unmarshal panics on the
string column.

This change makes the parser locate the number column by Type, drop
nulls, and return:

- A scalar (float64) when the response has a single value, preserving
  the existing successCondition: result < X style for ungrouped queries.
- A []float64 when the response has multiple values, mirroring how the
  Prometheus provider surfaces vector results. Users compose conditions
  with standard Expr functions like max(result) < X, mean(result),
  all(result, # < X), or any(result, # >= X).

When the query is grouped, the measurement also exposes a
metadata.groups field with comma-separated name=value pairs so
operators can map an outlier in result back to the entity that
produced it.

No new API field, no CRD/proto changes. Pre-existing ungrouped queries
are byte-identical in behavior; pre-existing grouped queries crashed
and now work.

Signed-off-by: Chris Greeno <cgreeno@gmail.com>
…error on bad numeric; JSON-encode metadata.groups

Address three issues found in code review of the grouped-query parse fix:

1. Null alignment bug. With separate Numeric() and Strings() helpers,
   null entries in the number column were skipped but the
   corresponding group entry was kept, shifting the labels off by one
   for any value after a null. Replaced with a single extractValues
   walk that drops the matching group entry alongside any null number
   and keeps the parallel slices in sync.

2. Silent drop on bad numeric. A non-null, non-numeric entry in a
   number column used to be quietly dropped, which would cause the
   analysis to evaluate against an incomplete result and potentially
   ship a bad rollout. Now returns an explicit error naming the
   offending column.

3. CSV-encoded metadata.groups is fragile. Datadog tag values can
   legally contain `,` and `=`, which would corrupt the previous
   `name=value,name=value` format. Switched to a JSON array of
   {name, value} objects so notification templates can fromJson it
   cleanly.

Also removes a vestigial reflect.ValueOf(...).IsZero() check that was
made dead by the new column-length and numColIdx logic.

New tests cover: null-leading number column with correct pairing,
non-numeric entries returning an error, group-only response (no
number column), reversed column ordering, legacy responses without a
type field. Package coverage rises to 92.6%; new helpers at 100%.

Signed-off-by: Chris Greeno <cgreeno@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Published E2E Test Results

  4 files    4 suites   3h 51m 59s ⏱️
120 tests 106 ✅  7 💤  7 ❌
496 runs  454 ✅ 28 💤 14 ❌

For more details on these failures, see this check.

Results for commit a1bdf16.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Published Unit Test Results

2 483 tests   2 483 ✅  3m 20s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit a1bdf16.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.12%. Comparing base (017676b) to head (a1bdf16).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4746      +/-   ##
==========================================
+ Coverage   85.01%   85.12%   +0.10%     
==========================================
  Files         164      164              
  Lines       18989    19026      +37     
==========================================
+ Hits        16144    16195      +51     
+ Misses       1995     1988       -7     
+ Partials      850      843       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cgreeno added 2 commits May 28, 2026 14:27
…ve complexity

SonarCloud flagged extractValues as cognitive complexity 20 (allowed
15, rule go:S3776) and noted an unnecessary named-variable in the
inner json.Unmarshal block (rule godre:S8193). Both are addressed by
extracting two small helpers:

- findColumns: locates the first numeric and first group column in
  one pass. Removes the dual-purpose loop from the top of
  extractValues.
- groupNameAt: returns the tag name at an index in a group column, or
  ("", false) if the column is missing/short/non-string. Removes the
  three-level nesting (nil check + bounds check + unmarshal-error
  check) inside extractValues.

extractValues itself is now a single linear walk over the number
column. No behavior change; all existing tests pass and the new
groupNameAt helper is covered by 4 focused subtests (nil column,
index past end, valid string entry, non-string entry).

Package coverage 92.7%, all new helpers at 100%.

Signed-off-by: Chris Greeno <cgreeno@gmail.com>
@sonarqubecloud
Copy link
Copy Markdown

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented May 29, 2026

@kostis-codefresh - Thanks again for the previous review and guidance!

Hopefully this looks a bit better? If not feedback more than welcome. :)

@cgreeno cgreeno marked this pull request as ready for review May 29, 2026 08:18
@cgreeno cgreeno requested a review from a team as a code owner May 29, 2026 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant