Retry transient network errors in HttpRetryPolicy (fix #260)#398
Open
msrathore-db wants to merge 1 commit into
Open
Retry transient network errors in HttpRetryPolicy (fix #260)#398msrathore-db wants to merge 1 commit into
msrathore-db wants to merge 1 commit into
Conversation
`HttpRetryPolicy.invokeWithRetry` only consulted `shouldRetry` after the fetch resolved, so when `fetch()` itself rejected with a network error (`socket hang up`, `ECONNRESET`, etc.) the rejection propagated on attempt #1 with no retry — the exact failure pattern reported in issue #260, including the cloud-storage variant haggholm and eranga-acerta reported. A related issue: `response.buffer()` (ThriftHttpConnection) and `response.arrayBuffer()` (CloudFetchResultHandler) sat outside the retried block, so node-fetch's "Premature close" body-stream errors were also not retried. Changes: * `HttpRetryPolicy.invokeWithRetry` catches operation rejections, classifies them via `isRetryableNetworkError` (FetchError `system` / `body-timeout`, OS errnos `ECONNRESET`/`ECONNREFUSED`/`ETIMEDOUT`/ `EHOSTUNREACH`/`ENETUNREACH`/`EPIPE`/`ENOTFOUND`/`EAI_AGAIN`, and message patterns `socket hang up`/`premature close`/`aborted`), and retries with the same attempt-budget/backoff used for HTTP retries. Aligns with what Python (urllib3 default OSError retry) and JDBC (Apache HttpClient retry handler) already do. * `ThriftHttpConnection.write` reads `response.buffer()` inside the retried block, so body-stream failures are retried too. * `CloudFetchResultHandler.fetch` reads `response.arrayBuffer()` inside the retried block, with a guarded fallback so existing test stubs that pre-bake `Response` objects keep working. Idempotency is preserved: `ExecuteStatement` and other write-y Thrift methods continue to use `NullRetryPolicy`. Network-error retry only applies where `HttpRetryPolicy` was already in use (idempotent Thrift methods + CloudFetch GETs). Verified end-to-end against the pecotesting Azure warehouse with TLS- layer RST injection on `*.blob.core.windows.net` mid-response: base crashes with `FetchError: socket hang up` (matches issue #260 exactly); with this fix the same fault injection passes transparently with correct row counts. Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #260 — the
FetchError: socket hang up/Premature close/ECONNRESETfailures users have been reporting during long-running workloads, including the cloud-storage variant (*.blob.core.windows.netresult-link downloads) reported by @haggholm and @eranga-acerta on the issue.Root cause
HttpRetryPolicy.invokeWithRetryonly consultedshouldRetryafter the fetch resolved, so whenfetch()itself rejected with a network error the rejection propagated on attempt #1 with no retry. Additionally,response.buffer()(Thrift) andresponse.arrayBuffer()(CloudFetch) sat outside the retried block, so node-fetch's "Premature close" body-stream errors were not retried either.Sibling drivers don't have this gap: Python relies on urllib3's default
OSErrorretry, JDBC uses Apache HttpClient's stale-connection detection + retry handler.Changes
HttpRetryPolicy.invokeWithRetry— catches operation rejections, classifies them viaisRetryableNetworkError(FetchErrorsystem/body-timeout, OS errnosECONNRESET/ECONNREFUSED/ETIMEDOUT/EHOSTUNREACH/ENETUNREACH/EPIPE/ENOTFOUND/EAI_AGAIN, and message patternssocket hang up/premature close/aborted), and retries with the same attempt-budget / backoff used for HTTP retries.ThriftHttpConnection.write— readsresponse.buffer()inside the retried block so body-stream failures are retried too.CloudFetchResultHandler.fetch— readsresponse.arrayBuffer()inside the retried block, with a guarded fallback so existing test stubs that pre-bakeResponseobjects keep working.Idempotency
Unchanged.
ExecuteStatementand other write-y Thrift methods continue to useNullRetryPolicy. Network-error retry only applies whereHttpRetryPolicywas already in use (idempotent Thrift methods + CloudFetch GETs).Test plan
*.blob.core.windows.netmid-response. Base branch crashes with the exactFetchErrorfrom issue [DATABRICKS ERROR]: "FetchError: socket hang up" #260; with this fix the same fault injection passes transparently with correct row counts.Risk
ExecuteStatement(NullRetryPolicy) is unchanged — no risk of double-execution.This pull request and its description were written by Isaac.