[3/7] Telemetry Client Management: TelemetryClient and Provider#326
[3/7] Telemetry Client Management: TelemetryClient and Provider#326samikshya-db wants to merge 34 commits intomainfrom
Conversation
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
06b1f7a to
1cdb716
Compare
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
87d1e85 to
32003e9
Compare
Fix TypeScript compilation error by implementing getAuthHeaders method required by IClientContext interface.
Fix TypeScript compilation error by implementing getAuthHeaders method required by IClientContext interface.
2893a89 to
4df6ce0
Compare
Added osArch, runtimeVendor, localeName, charSetEncoding, and processName fields to DriverConfiguration to match JDBC implementation.
Added osArch, runtimeVendor, localeName, charSetEncoding, and processName fields to DriverConfiguration to match JDBC implementation.
Feature flag fetching was also missing proxy support like telemetry exporter was. Applied the same fix: - Get HTTP agent from connection provider - Pass agent to fetch call for proxy support - Follows same pattern as CloudFetchResultHandler and DBSQLSession - Supports http/https/socks proxies with authentication This completes proxy support for all HTTP operations in the telemetry system (both telemetry export and feature flag fetching). Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…or infrastructure Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter, MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening, overflow protection, async flush/close, errorStack redaction, and IAuthentication-based auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct authProvider injection pattern from main. Co-authored-by: samikshya-chand_data
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
…e, MetricsAggregator Picks up #362 (shared connection stack + close() flush race fix) merged to main. Resolves add/add conflicts by preferring main's versions of the three infrastructure files, consistent with the prior merge strategy. Co-authored-by: samikshya-chand_data
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Co-authored-by: samikshya-chand_data
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
…race If getOrCreateClient ran after refCount hit 0 but before clients.delete(host), it would receive a closing TelemetryClient. Deleting synchronously first means any concurrent caller gets a fresh instance instead. Co-authored-by: samikshya-chand_data
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
- close() made synchronous (no I/O, no reason for async); uses finally to guarantee this.closed=true even when logger throws - TelemetryClientProvider.releaseClient() made synchronous for the same reason - Misleading ConcurrentHashMap reference removed from docstring - getRefCount/getActiveClients marked @internal (test-only surface) - Update tests to match: rejects→throws stubs, remove await on sync calls Co-authored-by: samikshya-chand_data
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
| } | ||
|
|
||
| // Decrement reference count | ||
| holder.refCount -= 1; |
There was a problem hiding this comment.
[F5] Refcount underflow is silent; double-release swaps a live client — Severity: High
holder.refCount -= 1 then if (<= 0) has no guard. A double-release (caller finally-path bug, mismatched get/release convention) drives the count to -1 and trips the close+delete branch while another legitimate caller still holds a reference. The next getOrCreateClient(host) creates a fresh instance; the original caller is now operating on a closed object. Once [5/7] adds real HTTP, writes will silently no-op or throw on the stale reference.
Also: header comment at lines 35-37 says "The map entry is deleted before awaiting close() so a concurrent getOrCreateClient call always gets a fresh instance." — but close() is sync (TelemetryClient.ts:51). Nothing is awaited. The comment describes a concurrency model the code isn't implementing.
Fix:
if (holder.refCount <= 0) {
logger.log(LogLevel.warn, `Unbalanced release for ${host}`);
return;
}
holder.refCount -= 1;And either make close() return Promise<void> now and await it after the delete, or rewrite the "before awaiting close()" comment to match the current sync semantics.
Add tests:
should not throw or corrupt state when release called more times than getgetOrCreateClient after full release returns a fresh non-closed client
Posted by code-review-squad • flagged by devils-advocate, ops, security, architecture, language, test.
| * @param host The host identifier (e.g., "workspace.cloud.databricks.com") | ||
| * @returns The telemetry client for the host | ||
| */ | ||
| getOrCreateClient(host: string): TelemetryClient { |
There was a problem hiding this comment.
[F6] Host used as map key without normalization — alias confusion + unbounded growth — Severity: Medium
getOrCreateClient(host) and releaseClient(host) use the raw string as the Map key. example.com, https://example.com, Example.COM, example.com:443, example.com/, example.com. (trailing dot), and zero-width-padded variants all create distinct entries for one logical host. This:
- Defeats the stated "share clients across connections to the same host" purpose when callers are inconsistent.
- Leaks memory under host-variant input (long-running daemons iterating over per-env warehouses).
- Orphans holders whose refcount never reaches zero (get under one form, release under another).
No size cap either — no upper bound on how big clients can grow.
Fix:
- Normalize on entry (lowercase, strip protocol, strip default port, strip trailing slash/dot, reject whitespace/zero-width/invalid).
buildTelemetryUrlintelemetryUtils.tsalready does the parse+validate — reuse it. - Use the canonical form as the map key; keep raw input only for logging.
- Add a bounded-size guard (e.g., warn-log once at a soft threshold of 128).
Posted by code-review-squad • flagged by security, ops.
| try { | ||
| holder.client.close(); | ||
| logger.log(LogLevel.debug, `Closed and removed TelemetryClient for host: ${host}`); | ||
| } catch (error: any) { |
There was a problem hiding this comment.
[F9] catch (error: any) + error.message crashes on non-Error throws — Severity: Medium
throw null / throw undefined / throw "string" cause TypeError on .message access, escaping the "Swallow all exceptions per requirement" contract claimed in the comment.
Also: catch (error: any) annotation is non-idiomatic in this codebase — every other catch is untyped (see DBSQLClient.ts:253, OAuthManager.ts:133, FederationProvider.ts:106, BaseCommand.ts:19,65). tsconfig doesn't enable useUnknownInCatchVariables so the default already makes error behave as any.
Fix:
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
logger.log(LogLevel.debug, `Error releasing TelemetryClient: ${msg}`);
}Same fix applies to TelemetryClient.ts:55-61.
Posted by code-review-squad • flagged by devils-advocate, security, language.
| * | ||
| * Reference counts are incremented synchronously so there are no async races | ||
| * on the count itself. The map entry is deleted before awaiting close() so a | ||
| * concurrent getOrCreateClient call always gets a fresh instance. |
There was a problem hiding this comment.
[F10] "deleted before awaiting close()" comment is false — close() is sync — Severity: Medium
The header comment at lines 35-37 says:
Reference counts are incremented synchronously so there are no async races on the count itself. The map entry is deleted before awaiting close() so a concurrent getOrCreateClient call always gets a fresh instance.
And the inline comment at 96-98 says:
Delete from map before awaiting close so a concurrent getOrCreateClient creates a fresh client rather than receiving this closing one.
Both describe a concurrency model the code isn't implementing: TelemetryClient.close() is sync (TelemetryClient.ts:51), and holder.client.close() is called synchronously right after this.clients.delete(host). Nothing is awaited. When close() becomes async in [5/7] (HTTP flush / abort / drain), the code will still not await — and the race the comment claims is handled will actually materialize: new getOrCreateClient returns a fresh instance while the old one's async close is still flushing.
Fix: Either
- Make
close()returnPromise<void>now; changereleaseClienttoasync;await holder.client.close()after the map delete. Callers (including [5/7]) can be written to the final shape today. - Or correct the comments to describe the current sync semantics — drop the "awaiting" language, note the async-close invariant must be re-established when
close()becomes async.
Posted by code-review-squad • flagged by devils-advocate, architecture, language, maintainability.
|
|
||
| new TelemetryClientProvider(context); | ||
|
|
||
| expect(logSpy.calledWith(LogLevel.debug, 'Created TelemetryClientProvider')).to.be.true; |
There was a problem hiding this comment.
[F12] Tests assert on exact log strings; tautological assertion — Severity: Medium
Nine assertions in this file match exact debug-log strings (e.g., "Created new TelemetryClient for host:", "TelemetryClient reference count for", "Error releasing TelemetryClient:", etc., at roughly lines 42, 94, 104, 183, 196, 207, 221, 356-385). A cosmetic reword of any of these log lines breaks unrelated tests without catching any real regression.
Also in TelemetryClient.test.ts:116-119 there's a tautology: expect(true).to.be.true — asserts nothing about the code under test.
Fix:
- Relax log matchers to regex:
sinon.match(/created.*TelemetryClient/i)instead of exact-string matching. - Replace the tautology with something that exercises behavior:
expect(() => client.close()).to.not.throw().
Posted by code-review-squad • flagged by test, devils-advocate, maintainability.
Code Review Squad ReportMerge Safety Score: 70/100 — MODERATE RISK 7 parallel reviewers (security, architecture, language, ops-safety, test, maintainability, devil's advocate) reviewed PR #326 (+770 LOC vs main after PR #325 landed). Filter applied by author: dismissed findings about dead/unused/premature code that has no current runtime impact (F1 Findings posted as inline commentsHigh severity (1)
Medium severity (4)
Low severity (8, not posted inline)
Verified & dismissed
Ship recommendationMinimum patch before merge: fix F5 (refcount guard) and F10 (fix the stale "awaiting close()" comment or make close async now). F6/F9/F12 should follow in the same PR — small diffs, real risks. Lows can be bundled into the [4/7] or [5/7] wire-up PR. Feedback? Drop it in #code-review-squad-feedback. |
[3/7] Telemetry Client Management
Part of the 7-part telemetry stack. Depends on #324, #325, and #362.
What this PR adds
TelemetryClient (lib/telemetry/TelemetryClient.ts): per-host client with idempotent close() and debug-level logging.
TelemetryClientProvider (lib/telemetry/TelemetryClientProvider.ts): reference-counted manager — one TelemetryClient per host, created on first use and closed when the last connection releases it.
urlUtils.ts (lib/telemetry/urlUtils.ts): builds a full URL from host + path, adding https:// when no protocol is present.
Sync with main
Merged onto main twice to pick up the full infrastructure from #325 (CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter, MetricsAggregator, TelemetryEventEmitter, types) and the shared-connection-stack + close() flush-race fix from #362. Where both branches added the same file, main's version was preferred.
Tests
39 unit tests for TelemetryClient and TelemetryClientProvider; all 254 telemetry tests pass.