Skip to content

[3/7] Telemetry Client Management: TelemetryClient and Provider#326

Open
samikshya-db wants to merge 34 commits intomainfrom
telemetry-3-client-management
Open

[3/7] Telemetry Client Management: TelemetryClient and Provider#326
samikshya-db wants to merge 34 commits intomainfrom
telemetry-3-client-management

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

@samikshya-db samikshya-db commented Jan 28, 2026

[3/7] Telemetry Client Management

Part of the 7-part telemetry stack. Depends on #324, #325, and #362.

What this PR adds

TelemetryClient (lib/telemetry/TelemetryClient.ts): per-host client with idempotent close() and debug-level logging.

TelemetryClientProvider (lib/telemetry/TelemetryClientProvider.ts): reference-counted manager — one TelemetryClient per host, created on first use and closed when the last connection releases it.

urlUtils.ts (lib/telemetry/urlUtils.ts): builds a full URL from host + path, adding https:// when no protocol is present.

Sync with main

Merged onto main twice to pick up the full infrastructure from #325 (CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter, MetricsAggregator, TelemetryEventEmitter, types) and the shared-connection-stack + close() flush-race fix from #362. Where both branches added the same file, main's version was preferred.

Tests

39 unit tests for TelemetryClient and TelemetryClientProvider; all 254 telemetry tests pass.

samikshya-db and others added 5 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
@samikshya-db samikshya-db force-pushed the telemetry-2-infrastructure branch from 06b1f7a to 1cdb716 Compare January 29, 2026 20:20
samikshya-db and others added 6 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
@samikshya-db samikshya-db force-pushed the telemetry-3-client-management branch from 87d1e85 to 32003e9 Compare January 29, 2026 20:21
Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.
Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.
@samikshya-db samikshya-db force-pushed the telemetry-3-client-management branch from 2893a89 to 4df6ce0 Compare January 29, 2026 20:27
Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.
Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.
samikshya-db added a commit that referenced this pull request Feb 5, 2026
samikshya-db and others added 5 commits February 5, 2026 11:53
Feature flag fetching was also missing proxy support like telemetry
exporter was. Applied the same fix:

- Get HTTP agent from connection provider
- Pass agent to fetch call for proxy support
- Follows same pattern as CloudFetchResultHandler and DBSQLSession
- Supports http/https/socks proxies with authentication

This completes proxy support for all HTTP operations in the telemetry
system (both telemetry export and feature flag fetching).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Base automatically changed from telemetry-2-infrastructure to main April 21, 2026 05:26
…or infrastructure

Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter,
MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening,
overflow protection, async flush/close, errorStack redaction, and IAuthentication-based
auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct
authProvider injection pattern from main.

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…e, MetricsAggregator

Picks up #362 (shared connection stack + close() flush race fix) merged to main.
Resolves add/add conflicts by preferring main's versions of the three infrastructure
files, consistent with the prior merge strategy.

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…race

If getOrCreateClient ran after refCount hit 0 but before clients.delete(host),
it would receive a closing TelemetryClient. Deleting synchronously first means
any concurrent caller gets a fresh instance instead.

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

- close() made synchronous (no I/O, no reason for async); uses finally to
  guarantee this.closed=true even when logger throws
- TelemetryClientProvider.releaseClient() made synchronous for the same reason
- Misleading ConcurrentHashMap reference removed from docstring
- getRefCount/getActiveClients marked @internal (test-only surface)
- Update tests to match: rejects→throws stubs, remove await on sync calls

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

}

// Decrement reference count
holder.refCount -= 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F5] Refcount underflow is silent; double-release swaps a live client — Severity: High

holder.refCount -= 1 then if (<= 0) has no guard. A double-release (caller finally-path bug, mismatched get/release convention) drives the count to -1 and trips the close+delete branch while another legitimate caller still holds a reference. The next getOrCreateClient(host) creates a fresh instance; the original caller is now operating on a closed object. Once [5/7] adds real HTTP, writes will silently no-op or throw on the stale reference.

Also: header comment at lines 35-37 says "The map entry is deleted before awaiting close() so a concurrent getOrCreateClient call always gets a fresh instance." — but close() is sync (TelemetryClient.ts:51). Nothing is awaited. The comment describes a concurrency model the code isn't implementing.

Fix:

if (holder.refCount <= 0) {
  logger.log(LogLevel.warn, `Unbalanced release for ${host}`);
  return;
}
holder.refCount -= 1;

And either make close() return Promise<void> now and await it after the delete, or rewrite the "before awaiting close()" comment to match the current sync semantics.

Add tests:

  • should not throw or corrupt state when release called more times than get
  • getOrCreateClient after full release returns a fresh non-closed client

Posted by code-review-squad • flagged by devils-advocate, ops, security, architecture, language, test.

* @param host The host identifier (e.g., "workspace.cloud.databricks.com")
* @returns The telemetry client for the host
*/
getOrCreateClient(host: string): TelemetryClient {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F6] Host used as map key without normalization — alias confusion + unbounded growth — Severity: Medium

getOrCreateClient(host) and releaseClient(host) use the raw string as the Map key. example.com, https://example.com, Example.COM, example.com:443, example.com/, example.com. (trailing dot), and zero-width-padded variants all create distinct entries for one logical host. This:

  • Defeats the stated "share clients across connections to the same host" purpose when callers are inconsistent.
  • Leaks memory under host-variant input (long-running daemons iterating over per-env warehouses).
  • Orphans holders whose refcount never reaches zero (get under one form, release under another).

No size cap either — no upper bound on how big clients can grow.

Fix:

  1. Normalize on entry (lowercase, strip protocol, strip default port, strip trailing slash/dot, reject whitespace/zero-width/invalid). buildTelemetryUrl in telemetryUtils.ts already does the parse+validate — reuse it.
  2. Use the canonical form as the map key; keep raw input only for logging.
  3. Add a bounded-size guard (e.g., warn-log once at a soft threshold of 128).

Posted by code-review-squad • flagged by security, ops.

try {
holder.client.close();
logger.log(LogLevel.debug, `Closed and removed TelemetryClient for host: ${host}`);
} catch (error: any) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F9] catch (error: any) + error.message crashes on non-Error throws — Severity: Medium

throw null / throw undefined / throw "string" cause TypeError on .message access, escaping the "Swallow all exceptions per requirement" contract claimed in the comment.

Also: catch (error: any) annotation is non-idiomatic in this codebase — every other catch is untyped (see DBSQLClient.ts:253, OAuthManager.ts:133, FederationProvider.ts:106, BaseCommand.ts:19,65). tsconfig doesn't enable useUnknownInCatchVariables so the default already makes error behave as any.

Fix:

} catch (error) {
  const msg = error instanceof Error ? error.message : String(error);
  logger.log(LogLevel.debug, `Error releasing TelemetryClient: ${msg}`);
}

Same fix applies to TelemetryClient.ts:55-61.

Posted by code-review-squad • flagged by devils-advocate, security, language.

*
* Reference counts are incremented synchronously so there are no async races
* on the count itself. The map entry is deleted before awaiting close() so a
* concurrent getOrCreateClient call always gets a fresh instance.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F10] "deleted before awaiting close()" comment is false — close() is sync — Severity: Medium

The header comment at lines 35-37 says:

Reference counts are incremented synchronously so there are no async races on the count itself. The map entry is deleted before awaiting close() so a concurrent getOrCreateClient call always gets a fresh instance.

And the inline comment at 96-98 says:

Delete from map before awaiting close so a concurrent getOrCreateClient creates a fresh client rather than receiving this closing one.

Both describe a concurrency model the code isn't implementing: TelemetryClient.close() is sync (TelemetryClient.ts:51), and holder.client.close() is called synchronously right after this.clients.delete(host). Nothing is awaited. When close() becomes async in [5/7] (HTTP flush / abort / drain), the code will still not await — and the race the comment claims is handled will actually materialize: new getOrCreateClient returns a fresh instance while the old one's async close is still flushing.

Fix: Either

  1. Make close() return Promise<void> now; change releaseClient to async; await holder.client.close() after the map delete. Callers (including [5/7]) can be written to the final shape today.
  2. Or correct the comments to describe the current sync semantics — drop the "awaiting" language, note the async-close invariant must be re-established when close() becomes async.

Posted by code-review-squad • flagged by devils-advocate, architecture, language, maintainability.


new TelemetryClientProvider(context);

expect(logSpy.calledWith(LogLevel.debug, 'Created TelemetryClientProvider')).to.be.true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F12] Tests assert on exact log strings; tautological assertion — Severity: Medium

Nine assertions in this file match exact debug-log strings (e.g., "Created new TelemetryClient for host:", "TelemetryClient reference count for", "Error releasing TelemetryClient:", etc., at roughly lines 42, 94, 104, 183, 196, 207, 221, 356-385). A cosmetic reword of any of these log lines breaks unrelated tests without catching any real regression.

Also in TelemetryClient.test.ts:116-119 there's a tautology: expect(true).to.be.true — asserts nothing about the code under test.

Fix:

  • Relax log matchers to regex: sinon.match(/created.*TelemetryClient/i) instead of exact-string matching.
  • Replace the tautology with something that exercises behavior: expect(() => client.close()).to.not.throw().

Posted by code-review-squad • flagged by test, devils-advocate, maintainability.

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Code Review Squad Report

Merge Safety Score: 70/100 — MODERATE RISK

7 parallel reviewers (security, architecture, language, ops-safety, test, maintainability, devil's advocate) reviewed PR #326 (+770 LOC vs main after PR #325 landed).

Filter applied by author: dismissed findings about dead/unused/premature code that has no current runtime impact (F1 urlUtils.ts dismissed as out-of-scope — to be raised on the downstream PR that wires it in for the Bearer-token endpoint; F2/F3/F4/F13 dismissed as "premature for this layer").

Findings posted as inline comments

High severity (1)

  • [F5]TelemetryClientProvider.ts:93 — refcount underflow is silent; double-release swaps a live client; "awaiting close()" comment is false (close is sync)

Medium severity (4)

  • [F6]TelemetryClientProvider.ts:55 — host used as map key without normalization; example.com vs Example.COM vs https://example.com all map distinctly
  • [F9]TelemetryClientProvider.ts:104catch (error: any) + error.message crashes on non-Error throws, escaping the "swallow all" contract
  • [F10]TelemetryClientProvider.ts:37 — header docstring describes async-close semantics the code doesn't implement
  • [F12]TelemetryClientProvider.test.ts:42 — tests match exact debug-log strings; tautology at TelemetryClient.test.ts:116

Low severity (8, not posted inline)

F7 no closeAll() / shutdown path — once HTTP lands, keepalive agents will block Node process exit; F8 every signal at LogLevel.debug, invisible at default info logger; F11 circular context ref not nulled on close; F14 export default X; trailing-export + constructor(private ...) auto-property nowhere else in lib/ — sibling ExceptionClassifier.ts:30 uses canonical inline style; F15 @internal JSDoc tag doesn't enforce privacy, getRefCount/getActiveClients ship to .d.ts; F16 .prettierignore coverage/ is unrelated scope creep; F18 sync close(): void in a codebase where every other close is async (DBSQLSession, DBSQLClient, DBSQLOperation); F19 try/catch around logger.log() is overkill, swallow-comment styles inconsistent (// swallow vs // Swallow all exceptions per requirement).

Verified & dismissed

  • DCO bot comments — commits need signing; not a review finding.
  • "Delete-before-close is a concurrency race" — safe today under Node single-threaded event loop with sync close(). Will become a real race when close() goes async (see F10).

Ship recommendation

Minimum patch before merge: fix F5 (refcount guard) and F10 (fix the stale "awaiting close()" comment or make close async now). F6/F9/F12 should follow in the same PR — small diffs, real risks. Lows can be bundled into the [4/7] or [5/7] wire-up PR.


Feedback? Drop it in #code-review-squad-feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants