Migrating to 2.8.0 from 1.44.0 (mainly collector-related confusion) #7302

luckyycode · 2025-07-09T17:50:59Z

luckyycode
Jul 9, 2025

In 1.44 I had default_strategy type 'const' and I had every trace visible as I wanted in my project.
After reading new docs and upgrading to 2.8.0 I got weird behavior: The first minute after starting the collector I get part of my traces and then nothing, it just does not output any of trace anymore for hours.

The client-side is set up correctly. Parent-based always-sample or just always-sample samplers, over grpc.

Here's my config. NOTE: I tried probabilistic value of 1 (always sample). I tried queue size params and timeouts. I tried batch only processor with no params - no desired result, it's the same behavior as I said above.

jaeger-query extension is by itself in another cluster, no problems with jaeger-query here.

  extensions: [jaeger_storage, healthcheckv2]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [jaeger_storage_exporter]
  telemetry:
    resource:
      service.name: jaeger
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: 0.0.0.0
                port: 8888
    logs:
      level: debug

extensions:
  healthcheckv2:
    use_v2: true
    http:

  jaeger_storage:
    backends:
      main_storage:
        cassandra:
          schema:
            keyspace: "jaeger_v2_datacenter1"
            datacenter: "datacenter1"
            create: "${env:CASSANDRA_CREATE_SCHEMA:-true}"
            trace_ttl: "172h"
          connection:
            servers: "cass-dc1-internal.vds.arpa"
            auth:
              basic:
                username: ".."
                password: ".."
            tls:
              insecure: true

#  remote_sampling:
#    # We can either use file or adaptive sampling strategy in remote_sampling
#    file:
#      path:
#      default_sampling_probability: 1
#      reload_interval: 1s

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317" # also overriden in cli

processors:
#  batch:
#    timeout: 1s
#    send_batch_size: 256
#    send_batch_max_size: 4096
  tail_sampling:
    decision_wait: 5s
    policies: [{ name: test-policy-1, type: always_sample }]

exporters:
  jaeger_storage_exporter:
    trace_storage: main_storage
#    queue:
#      num_consumers: 1
#      queue_size: 4096

The only logs I got in jaeger process with debug level, no errors:

 info    transport/controlbuf.go:577     [transport] [server-transport 0xc0005231e0] loopyWriter exiting with error: transport closed by client {"resource": {"service.instance.id": "83e8c31b-db40-4331-8e38-b0a7ec9f9418", "service.name": "jaeger", "service.version": "v2.8.0"}, "grpc_log": true}

Cassandra is accessible, no strict fw rules. Tried to rollback to 1.44 and it works again.
Average trace rps is under 1k traces per second, no memory issues.

Running in kubernetes, nginx-ingress with grpc. Every port is exposed as per doc.

I am wondering if anybody got the same behavior.

@luckyycode · 2025-07-09T17:53:09Z

dosubot[bot]
bot Jul 9, 2025

Hi @luckyycode! I'm Dosu and I’m helping the jaeger team.

There are known issues with trace sampling and visibility when migrating to Jaeger 2.x, especially with the new Go-based collector architecture and its reliance on OpenTelemetry Collector components. Your config looks structurally correct, but a few things stand out:

The tail_sampling processor in 2.x is based on the upstream OpenTelemetry Collector and can behave differently than in 1.x, especially regarding sampling windows and decision timing. There have been recent updates to this processor, which could impact trace processing behavior (PR 7280).
The jaeger_storage_exporter is a new component in 2.x, and issues with queue/batch/exporter lifecycle can cause traces to be dropped or stuck after startup. This is a known edge case, especially if the exporter or processor encounters a deadlock or misconfiguration (exporter implementation).
There are compatibility issues between OpenTelemetry SDKs and Jaeger’s adaptive sampling. OpenTelemetry SDKs often lack the sampler.type and sampler.param attributes, which are required for Jaeger’s adaptive sampling to work as expected. This can result in the collector defaulting to a probabilistic strategy or not sampling as intended (discussion, issue).

A few troubleshooting steps:

Double-check that your trace_storage value in jaeger_storage_exporter matches the name of your backend under jaeger_storage.backends (e.g., main_storage).
Try temporarily removing the tail_sampling processor and using only a batch processor to see if traces flow as expected. If they do, the issue is likely with tail sampling configuration or timing.
Enable detailed metrics and check for any spikes or drops in the collector’s Prometheus metrics, especially around queue sizes, dropped spans, or exporter errors.
Review debug logs for any warnings or errors related to storage initialization or exporter startup, not just gRPC transport logs.

If you’re using OpenTelemetry SDKs on the client side, be aware that adaptive/remote sampling may not work as expected due to missing attributes, and you may need to use a constant sampler or adjust your collector’s sampling configuration accordingly (discussion).

Let me know if you see any relevant metrics or logs after making these changes, or if traces start flowing again with a simplified processor pipeline.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger - Distributed Tracing Platform

Migrating to 2.8.0 from 1.44.0 (mainly collector-related confusion) #7302

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Jaeger - Distributed Tracing Platform

Migrating to 2.8.0 from 1.44.0 (mainly collector-related confusion) #7302

Uh oh!

Uh oh!

luckyycode Jul 9, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jul 9, 2025

luckyycode
Jul 9, 2025

dosubot[bot]
bot Jul 9, 2025