Skip to content

.NET: [Breaking Change] Auto-wire ChatClient with OpenTelemetryChatClient in OpenTelemetryAgent#5750

Merged
rogerbarreto merged 9 commits into
mainfrom
copilot/fix-maf-telemetry-flow
May 13, 2026
Merged

.NET: [Breaking Change] Auto-wire ChatClient with OpenTelemetryChatClient in OpenTelemetryAgent#5750
rogerbarreto merged 9 commits into
mainfrom
copilot/fix-maf-telemetry-flow

Conversation

Copilot AI commented May 11, 2026

Copy link
Copy Markdown
Contributor

Motivation and Context

OpenTelemetryAgent instrumented only the agent-level invoke_agent span; the underlying IChatClient was untouched, so model-level chat spans, usage metrics, and Azure Monitor traces never flowed unless callers manually wrapped every chat client. End result: telemetry silently absent in Foundry/hosted agent samples even with an exporter configured.

Description

OpenTelemetryAgent now auto-wraps the inner ChatClientAgent's IChatClient with OpenTelemetryChatClient on each invocation, so chat-level telemetry flows alongside invoke_agent.

  • OpenTelemetryAgent ctors — the original OpenTelemetryAgent(AIAgent innerAgent, string? sourceName = null) constructor signature is preserved (binary-compatible) and delegates with auto-wiring enabled. A new overload OpenTelemetryAgent(AIAgent innerAgent, string? sourceName, bool autoWireChatClient) adds the opt-out. The sourceName is normalized once in the constructor (empty string is treated as OpenTelemetryConsts.DefaultSourceName) and reused by both the outer OpenTelemetryChatClient and the auto-wired inner OpenTelemetryChatClient, so agent-level and chat-level spans always land on the same ActivitySource. The opt-out flag lives only on the constructor (not on the AIAgentBuilder extension), keeping chat-client terminology out of the abstract builder surface.
  • Per-invocation wiringForwardingChatClient.GetResponseAsync / GetStreamingResponseAsync route the AgentRunOptions through a new GetRunOptionsWithChatClientWiring helper that:
    1. Short-circuits when auto-wire is disabled.
    2. Resolves the nested ChatClientAgent via InnerAgent.GetService<ChatClientAgent>() (no type assumption on InnerAgent; wrapping agents that surface a nested ChatClientAgent through GetService are supported). No-ops when the result is null.
    3. No-ops when chatClientAgent.GetService<ChatClientAgentOptions>()?.UseProvidedChatClientAsIs is true (respects the user's explicit opt-out on the chat client pipeline; ChatClientAgent.GetService already exposes ChatClientAgentOptions).
    4. No-ops when chatClientAgent.GetService<IChatClient>()?.GetService<OpenTelemetryChatClient>() is already non-null (avoids double-wrap).
    5. Otherwise clones the caller's ChatClientAgentRunOptions (or, when the caller passes a plain AgentRunOptions, creates one and copies the base properties: ContinuationToken, AllowBackgroundResponses, AdditionalProperties, ResponseFormat) and sets ChatClientFactory to chain onto any user-supplied factory rather than replacing it. The factory step also inspects the post-user-factory result via GetService(typeof(OpenTelemetryChatClient)) and skips wrapping when the chat client is already instrumented, so a user factory that itself calls UseOpenTelemetry(...) does not produce duplicate chat spans.
  • UseOpenTelemetry extension — unchanged public surface; the existing extension simply constructs an OpenTelemetryAgent (with auto-wiring on by default), so telemetry now flows automatically for existing call sites.
  • Tests — full coverage of every wiring branch in GetRunOptionsWithChatClientWiring and WrapIfNeeded: default-on (sync + streaming), explicit opt-out via the constructor (sync + streaming), UseProvidedChatClientAsIs opt-out, non-ChatClientAgent no-op, no-double-wrap when the underlying chat client is pre-instrumented, chaining of a user-supplied ChatClientFactory, no-double-wrap when the user factory itself returns an OpenTelemetry-instrumented client, plain AgentRunOptions propagation of AllowBackgroundResponses / AdditionalProperties / ResponseFormat and (separately) ContinuationToken, ChatClientAgentRunOptions clone path with no user factory (asserts ChatOptions are preserved and the caller's instance is not mutated), and null / empty sourceName normalization (Theory: both produce spans on OpenTelemetryConsts.DefaultSourceName).

Behavioral breaking change

This PR introduces a behavioral (not API) change. Existing call sites of new OpenTelemetryAgent(innerAgent) and AIAgentBuilder.UseOpenTelemetry(...) will now begin emitting an additional chat span per invocation when the inner agent is (or surfaces) a ChatClientAgent whose IChatClient is not already wrapped with OpenTelemetryChatClient.

Impact is limited to a specific subset of users:

  • Users who previously wrapped both the agent and the chat client themselves are unaffected: the chat-client OTel wrapper is detected and the auto-wire path no-ops.
  • Users who wrapped only the agent (the original silent-telemetry case this PR fixes) will start receiving the missing chat-level spans, metrics, and Azure Monitor traces. This is the intended new default.
  • Users who do not want the new chat-level spans can opt out either:
    • On the constructor: new OpenTelemetryAgent(innerAgent, sourceName: null, autoWireChatClient: false)
    • On the ChatClientAgent itself: new ChatClientAgent(chatClient, new ChatClientAgentOptions { UseProvidedChatClientAsIs = true })

Source compatibility, binary compatibility, and the UseOpenTelemetry extension surface are all preserved. Only the runtime telemetry shape changes, and only in the direction of emitting strictly more (and previously missing) signal.

Opt-out (constructor-only)

var otelAgent = new OpenTelemetryAgent(innerAgent, sourceName: null, autoWireChatClient: false);

Or via ChatClientAgentOptions:

new ChatClientAgent(chatClient, new ChatClientAgentOptions { UseProvidedChatClientAsIs = true });

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? Behavioral breaking change documented above. Not source/binary breaking.

Copilot AI self-assigned this May 11, 2026
Copilot AI review requested due to automatic review settings May 11, 2026 18:27
Copilot AI removed the request for review from Copilot May 11, 2026 18:27
Copilot AI requested review from Copilot and removed request for Copilot May 11, 2026 18:36
Copilot AI changed the title [WIP] Fix MAF telemetry flow by default .NET: Auto-wire ChatClient with OpenTelemetryChatClient in OpenTelemetryAgent May 11, 2026
Copilot AI requested a review from rogerbarreto May 11, 2026 18:38
Comment thread dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgent.cs
Comment thread dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgentBuilderExtensions.cs Outdated
Copilot AI requested review from Copilot and removed request for Copilot May 11, 2026 18:57
Copilot AI requested a review from rogerbarreto May 11, 2026 18:58
Comment thread dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgent.cs Outdated
Copilot AI requested review from Copilot and removed request for Copilot May 12, 2026 07:36
Comment thread dotnet/src/Microsoft.Agents.AI/OpenTelemetryAgent.cs Outdated
Copilot AI requested a review from rogerbarreto May 12, 2026 07:37
Copilot AI requested review from Copilot and removed request for Copilot May 12, 2026 07:39
Copilot AI review requested due to automatic review settings May 12, 2026 07:42
wdhm added a commit to wdhm/hosted-triage-agent that referenced this pull request Jun 4, 2026
…plain DAC, drop dup OTel source) (#86)

Two cleanups, no behaviour additions:

1. Credential — back to `new DefaultAzureCredential()`.

   Every official sample in microsoft-foundry/foundry-samples/samples/csharp/
   hosted-agents/agent-framework/* (hello-world, simple-agent, local-tools,
   mcp-tools, etc.) does exactly this and only this:

       AIAgent agent = new AIProjectClient(projectEndpoint, new DefaultAzureCredential())
           .AsAIAgent(model: deployment, instructions: "...", ...);

   The Foundry runtime injects AZURE_CLIENT_ID + AZURE_TENANT_ID pointing at
   the per-agent Entra identity that `azd` auto-creates AND auto-grants the
   `Foundry User` role at project scope. Verified live:

       $ az role assignment list --assignee 22839df0-... --all
       Foundry User   /...accounts/ai-account-ivok5ban2cqiq/projects/ai-project-triage-agent-dev

   DAC picks up that identity via the env vars; no manual wiring needed.

   Removed:
   - The `[ModuleInitializer]` / top-level AZURE_TOKEN_CREDENTIALS=
     WorkloadIdentityCredential lock from PR #81 — we cargo-culted it
     from an AKS WIF pattern that Foundry does NOT use. App Insights
     confirms zero WIF activity in the process; the live identity is a
     classic user-assigned MI reachable via IMDS with our AZURE_CLIENT_ID.
   - The `new WorkloadIdentityCredential()` branch gated on
     AZURE_FEDERATED_TOKEN_FILE — that env var is never set on Foundry
     containers (confirmed by `azd env get-values` and by IMDS traces).

2. OpenTelemetry — drop `.AddSource("Experimental.Microsoft.Extensions.AI")`.

   Foundry's internal `Agent365Exporter` already subscribes to that source
   (Microsoft.Agents.AI.Foundry.Hosting ≥ 1.6.1, PR microsoft/agent-framework#5750).
   Registering it ourselves caused the formatter to walk the same span
   attribute set twice, hitting:

       System.ArgumentException: An item with the same key has already
       been added. Key: openai.api.type
         at Microsoft.Agents.A365.Observability.Runtime.Common.ExportFormatter.MapAttributes
         at Microsoft.Agents.A365.Observability.Runtime.Tracing.Exporters.Agent365Exporter.Export

   Once per export batch, in App Insights, on every invocation since v39.
   Async path so it didn't break requests, but it polluted the exception
   stream and risked future tighter coupling between exporter health and
   request handling.

   Kept the three sources we own end-to-end:
   - Experimental.Microsoft.Agents.AI  (invoke_agent spans from our
     .UseOpenTelemetry() wrap on each agent)
   - Azure.AI.AgentServer.Invocations  (invocations endpoint baggage)
   - TriageAgent.Tools                  (our own GitHub REST tool spans)

What this fix does NOT change:
- The 3-minute-first-attempt slowness pattern in CI. That is NOT auth-related
  (DAC matches the samples, RBAC is correct) and will be investigated next.
  Candidates: container cold-start budget, Invocations protocol path
  vs the simpler Responses path the samples use, or workflow-side retry
  cadence.

Co-authored-by: wdhm <wdhm@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET Usage: [Issues, PRs], Target: .Net

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

.NET: [Bug]: MAF telemetry not flowing by default / requires per-client manual enablement

6 participants