Skip to content

Add retry mechanism to telemetry requests #617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 261 commits into
base: main
Choose a base branch
from
Open

Conversation

saishreeeee
Copy link
Collaborator

@saishreeeee saishreeeee commented Jun 25, 2025

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

Retry mechanism for telemetry requests

How is this tested?

  • Unit tests
  • E2E Tests
  • Manually
  • N/A

Related Tickets & Documents

PECOBLR-586

Jesse and others added 30 commits August 5, 2022 16:23
* Isolate delay bounding logic
* Move error details scope up one-level.
* Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case.
* Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes
* Update docstring for make_request
* Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
* Test with multiple python versions.
* Update pyarrow to version 9.0.0 to address issue in relation to python 3.10 & a specific version of numpy being pulled in by pyarrow.

Closes #26 

Signed-off-by: David Black <[email protected]>
* Update changelog and bump to v2.0.4
* Specifically thank @dbaxa for this change.

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
* Add test: cursors are closed when connection closes

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Moe Derakhshani <[email protected]>
Signed-off-by: Moe Derakhshani <[email protected]>
Signed-off-by: Moe Derakhshani <[email protected]>

my [OAuth PR](https://github.com/databricks/databricks-sql-python/runs/8005844758?check_suite_focus=true) is blocked due to dco validation (following error):
<img width="1202" alt="Screen Shot 2022-08-25 at 12 05 40 PM" src="https://user-images.githubusercontent.com/22279672/186747897-c9d57586-366f-41f9-aa66-609f2bf3911f.png">



We should try to avoid running dco for internal databricks employees:
I am trying to relax the validation based on this guideline:
https://github.com/dcoapp/app/blob/main/README.md#skipping-sign-off-for-organization-members

and here:
https://stackoverflow.com/questions/62969381/is-it-in-line-with-the-dco-that-a-github-sign-off-needs-and-publishes-full-name
Signed-off-by: Moe Derakhshani <[email protected]>
Signed-off-by: Moe Derakhshani <[email protected]>
Signed-off-by: Moe Derakhshani <[email protected]>

this is undo of #42 till we figure out how to fix dco
This PR:
* Adds the foundation for OAuth against Databricks account on AWS with BYOIDP.
* It copies one internal module that Steve Weis @sweisdb wrote for Databricks CLI (oauth.py). Once ecosystem-dev team (Serge, Pieter) build a python sdk core we will move this code to their repo as a dependency. 
* the PR provides authenticators with visitor pattern format for stamping auth-token which later is intended to be moved to the repo owned by Serge @nfx and and Pieter @pietern
Signed-off-by: Jesse Whitehouse <[email protected]>
Bump to v2.1.0 and update changelog

Signed-off-by: Jesse Whitehouse <[email protected]>
* Refactor so we can unit test `inject_parameters`
* Add unit tests for inject_parameters
* Remove inaccurate comment. Per #51, spark sql does not support escaping a single quote with a second single quote.
* Closes #51 and adds unit tests plus the integration test provided in #56

Signed-off-by: Jesse Whitehouse <[email protected]>
Co-authored-by: Courtney Holcomb (@courtneyholcomb)
Co-authored-by: @mcannamela
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Add none check on _oauth_persistence in DatabricksOAuthProvider to avoid app crash when _oauth_persistence is None.

Signed-off-by: Jacky Hu <[email protected]>
* Support custom oauth client id and rediret port range

PySQL is used by other tools/CLIs which have own oauth client id,
we need to expose oauth_client_id and oauth_redirect_port_range
as the connection parameters to support this customization.

Signed-off-by: Jacky Hu <[email protected]>

* Change oauth redirect port range to port

Signed-off-by: Jacky Hu <[email protected]>

* Fix type check issue

Signed-off-by: Jacky Hu <[email protected]>

Signed-off-by: Jacky Hu <[email protected]>
Signed-off-by: Jacky Hu <[email protected]>
Signed-off-by: Jesse <[email protected]>
Follow up to #67 and #64 

* Regenerate TCLIService using latest TCLIService.thrift from DBR (#64)
* SI: Implement GET, PUT, and REMOVE (#67)
* Re-lock dependencies after merging `main`

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a retry mechanism to telemetry requests by integrating a custom HTTP adapter and retry policy, updates client methods to use a session with retries, and adjusts tests accordingly.

  • Introduce TelemetryHTTPAdapter to apply DatabricksRetryPolicy before each request
  • Update TelemetryClient to use requests.Session with mounted retry adapter and replace direct requests.post
  • Adjust unit and E2E tests to mock Session.post and verify retry behavior

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tests/unit/test_telemetry.py Updated mocks and assertions to patch and verify Session.post calls
tests/e2e/test_telemetry_retry.py New E2E tests for retry scenarios, mocking HTTPSConnectionPool
src/databricks/sql/telemetry/telemetry_client.py Added TelemetryHTTPAdapter, retry policy setup, session usage, session close, and log level tweak for uninitialized client
src/databricks/sql/exc.py Removed top‐level telemetry import and added a lazy import inside the constructor
Comments suppressed due to low confidence (2)

src/databricks/sql/telemetry/telemetry_client.py:438

  • The fallback log when the client isn’t initialized is set to debug, which may suppress important errors; consider using warning level to surface misconfiguration.
                logger.debug(

src/databricks/sql/telemetry/telemetry_client.py:203

  • [nitpick] Currently the retry adapter is only mounted for HTTPS; if future tests or endpoints use HTTP, consider mounting on "http://" as well to ensure consistency.
        self._session.mount("https://", adapter)

Signed-off-by: Sai Shree Pradhan <[email protected]>
* added functionality for export of failure logs

Signed-off-by: Sai Shree Pradhan <[email protected]>

* changed logger.error to logger.debug in exc.py

Signed-off-by: Sai Shree Pradhan <[email protected]>

* Fix telemetry loss during Python shutdown

Signed-off-by: Sai Shree Pradhan <[email protected]>

* unit tests for export_failure_log

Signed-off-by: Sai Shree Pradhan <[email protected]>

* try-catch blocks to make telemetry failures non-blocking for connector operations

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed redundant try/catch blocks, added try/catch block to initialize and get telemetry client

Signed-off-by: Sai Shree Pradhan <[email protected]>

* skip null fields in telemetry request

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed dup import, renamed func, changed a filter_null_values to lamda

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed unnecassary class variable and a redundant try/except block

Signed-off-by: Sai Shree Pradhan <[email protected]>

* public functions defined at interface level

Signed-off-by: Sai Shree Pradhan <[email protected]>

* changed export_event and flush to private functions

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* changed connection_uuid to thread local in thrift backend

Signed-off-by: Sai Shree Pradhan <[email protected]>

* made errors more specific

Signed-off-by: Sai Shree Pradhan <[email protected]>

* revert change to connection_uuid

Signed-off-by: Sai Shree Pradhan <[email protected]>

* reverting change in close in telemetry client

Signed-off-by: Sai Shree Pradhan <[email protected]>

* JsonSerializableMixin

Signed-off-by: Sai Shree Pradhan <[email protected]>

* isdataclass check in JsonSerializableMixin

Signed-off-by: Sai Shree Pradhan <[email protected]>

* convert TelemetryClientFactory to module-level functions, replace NoopTelemetryClient class with NOOP_TELEMETRY_CLIENT singleton, updated tests accordingly

Signed-off-by: Sai Shree Pradhan <[email protected]>

* renamed connection_uuid as session_id_hex

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added NotImplementedError to abstract class, added unit tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added PEP-249 link, changed NoopTelemetryClient implementation

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed unused import

Signed-off-by: Sai Shree Pradhan <[email protected]>

* made telemetry client close a module-level function

Signed-off-by: Sai Shree Pradhan <[email protected]>

* unit tests verbose

Signed-off-by: Sai Shree Pradhan <[email protected]>

* debug logs in unit tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* debug logs in unit tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed ABC from mixin, added try/catch block around executor shutdown

Signed-off-by: Sai Shree Pradhan <[email protected]>

* checking stuff

Signed-off-by: Sai Shree Pradhan <[email protected]>

* finding out

* finding out more

* more more finding out more nice

* locks are useless anyways

* haha

* normal

* := looks like walrus horizontally

* one more

* walrus again

* old stuff without walrus seems to fail

* manually do the walrussing

* change 3.13t, v2

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting, added walrus

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed walrus, removed test before stalling test

Signed-off-by: Sai Shree Pradhan <[email protected]>

* changed order of stalling test

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed debugging, added TelemetryClientFactory

Signed-off-by: Sai Shree Pradhan <[email protected]>

* remove more debugging

Signed-off-by: Sai Shree Pradhan <[email protected]>

* latency logs funcitionality

Signed-off-by: Sai Shree Pradhan <[email protected]>

* fixed type of return value in get_session_id_hex() in thrift backend

Signed-off-by: Sai Shree Pradhan <[email protected]>

* debug on TelemetryClientFactory lock

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* type notation for _waiters

Signed-off-by: Sai Shree Pradhan <[email protected]>

* called connection.close() in test_arraysize_buffer_size_passthrough

Signed-off-by: Sai Shree Pradhan <[email protected]>

* run all unit tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* more debugging

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed the connection.close() from that test, put debug statement before and after TelemetryClientFactory lock

Signed-off-by: Sai Shree Pradhan <[email protected]>

* more debug

Signed-off-by: Sai Shree Pradhan <[email protected]>

* more more more

Signed-off-by: Sai Shree Pradhan <[email protected]>

* why

Signed-off-by: Sai Shree Pradhan <[email protected]>

* whywhy

Signed-off-by: Sai Shree Pradhan <[email protected]>

* thread name

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added teardown to all tests except finalizer test (gc collect)

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added the get_attribute functions to the classes

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed tearDown, added connection.close() to first test

Signed-off-by: Sai Shree Pradhan <[email protected]>

* finally

Signed-off-by: Sai Shree Pradhan <[email protected]>

* remove debugging

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added test for export_latency_log, made mock of thrift backend with retry policy

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added multi threaded tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added TelemetryExtractor, removed multithreaded tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* formatting

Signed-off-by: Sai Shree Pradhan <[email protected]>

* fixes in test

Signed-off-by: Sai Shree Pradhan <[email protected]>

* fix in telemetry extractor

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added doc strings to latency_logger, abstracted export_telemetry_log

Signed-off-by: Sai Shree Pradhan <[email protected]>

* statement type, unit test fix

Signed-off-by: Sai Shree Pradhan <[email protected]>

* unit test fix

Signed-off-by: Sai Shree Pradhan <[email protected]>

* statement type changes

Signed-off-by: Sai Shree Pradhan <[email protected]>

* test_fetches fix

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added mocks to resolve the errors caused by log_latency decorator in tests

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed function in test_fetches cuz it is only used once

Signed-off-by: Sai Shree Pradhan <[email protected]>

* added _safe_call which returns None in case of errors in the get functions

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed the changes in test_client and test_fetches

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed the changes in test_fetches

Signed-off-by: Sai Shree Pradhan <[email protected]>

* test_telemetry

Signed-off-by: Sai Shree Pradhan <[email protected]>

* removed test

Signed-off-by: Sai Shree Pradhan <[email protected]>

---------

Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
Signed-off-by: Sai Shree Pradhan <[email protected]>
@saishreeeee saishreeeee changed the base branch from telemetry to main July 11, 2025 05:42
Signed-off-by: Sai Shree Pradhan <[email protected]>
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Signed-off-by: Sai Shree Pradhan <[email protected]>
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-db
Copy link
Contributor

Can you try to incorporate the common http client -

class DatabricksHttpClient:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.