Skip to content

Prometheus reporter silently drops metrics with same name due to inconsistent label sets across scopes #7844

@divyam-netapp

Description

@divyam-netapp

Description

After upgrading to Cadence v1.3.6, we're seeing a large volume of warnings from the Prometheus reporter. Metrics emitted from the scope that registers second are silently dropped, resulting in incomplete monitoring data.

a previously registered descriptor with the same fully-qualified name as <metric>
has different label names or a different help string

Affected metrics include persistence_latency, persistence_requests, persistence_requests_per_shard, persistence_latency_per_shard, cache_count, cache_evict, cache_latency, and others.

Root cause: Cadence emits the same metric name from multiple scopes that carry different tag sets. For example, emits persistence_latency_per_shard from two scopes here:

  • shardOperationsMetricsScope tagged with {operation, domain, shard, is_retry} (receives additionalTags).
  • shardOverallMetricsScope tagged with {operation, domain, shard} (does not receive additionalTags).

The additionalTags argument (e.g. is_retry) is passed in from callers like this one, but only applied to one of the two scopes.
The prometheus/client_golang library requires that all descriptors registered under the same metric name have the same label-name set. When the second scope tries to register with a different label set, registration fails. The Tally Prometheus reporter then returns a noopMetric{}, silently discarding all data from that scope.
This pattern works where there's no registration or label-consistency requirement, but is incompatible with the Prometheus reporter.

Steps to Reproduce / How to Trigger

  1. Deploy Cadence v1.3.6 with the Prometheus metrics reporter enabled.
  2. Run any workload that exercises persistence operations (e.g. workflow starts, activity completions, or timer firings).
  3. Observe warnings in the Cadence server logs matching "error in prometheus reporter" with "has different label names or a different help string".

Expected Behavior

All metrics should be successfully registered and reported to Prometheus with complete data across all scopes and label set.

Actual Behavior

  • The Prometheus registry rejects the second (and subsequent) registration(s) of a metric name when the label-name set differs from the first.
  • The Tally reporter returns a noopMetric{} for the rejected scope, silently dropping all data from it.
  • Operational dimensions like is_retry, specific operation values, or per-shard breakdowns become invisible in Prometheus.
  • High volume of warning-level log lines.

Logs / Screenshots

{ level":"warn","ts":"2026-03-05T04:01:24.714Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cache_count\", help: \"cache_count gauge\", constLabels: {}, variableLabels: [shard_id cadence_service operation cache_type]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

{"level":"warn","ts":"2026-03-05T10:13:44.267Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cache_count\", help: \"cache_count gauge\", constLabels: {}, variableLabels: [cadence_service shard_id operation cache_type]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

{"level":"warn","ts":"2026-03-05T10:10:45.013Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"persistence_requests_per_shard\", help: \"persistence_requests_per_shard counter\", constLabels: {}, variableLabels: [cadence_service operation shard_id]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

{"level":"warn","ts":"2026-03-05T05:21:49.657Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cache_evict\", help: \"cache_evict counter\", constLabels: {}, variableLabels: [cadence_service operation shard_id]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

{"level":"warn","ts":"2026-03-05T05:21:49.657Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"cache_evict\", help: \"cache_evict counter\", constLabels: {}, variableLabels: [cadence_service operation shard_id]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

{"level":"warn","ts":"2026-03-05T02:45:42.982Z","msg":"error in prometheus reporter","error":"a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"persistence_latency\", help: \"persistence_latency summary\", constLabels: {}, variableLabels: [cadence_service operation is_retry task_category]} has different label names or a different help string","logging-call-at":"metrics.go:151"}

Environment

  • Cadence server version: v1.3.6
  • Cadence SDK language and version (if applicable): N/A
  • Cadence web version (if applicable): N/A
  • DB & version: Apache Cassandra
  • Scale: Production

Suggested Fix

Ensure all emission paths for a given metric name use the same label-name set. Where additional context like is_retry is needed, either:

  1. Always include the label with a default value (e.g. "" or "false") on scopes where it doesn't apply, or
  2. Use a distinct metric name for the variant (e.g. persistence_latency_per_shard_with_retry)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions