Skip to content

fix(controller): add timeout to RemoteMCPServer registration#1805

Merged
EItanya merged 6 commits into
kagent-dev:mainfrom
AnantKumar17:fix/remotemcpserver-registration-timeout
May 11, 2026
Merged

fix(controller): add timeout to RemoteMCPServer registration#1805
EItanya merged 6 commits into
kagent-dev:mainfrom
AnantKumar17:fix/remotemcpserver-registration-timeout

Conversation

@AnantKumar17

Copy link
Copy Markdown

What

Fixes a bug where a single hung RemoteMCPServer registration blocks all subsequent RemoteMCPServer reconciliations for the lifetime of the controller process.

Closes #1785

Why

newHTTPClient() was returning http.DefaultClient (no timeout) when there were no custom headers, and listTools() was passing the context directly to client.Connect() and session.ListTools() with no deadline. A hung or unreachable endpoint — for example, an IPv6 address resolved in an IPv4-only cluster — would block the goroutine indefinitely, preventing any subsequent RemoteMCPServer from registering until the controller pod was restarted.

Changes

Single file changed: go/core/internal/controller/reconciler/reconciler.go

  • newHTTPClient(): always returns a *http.Client with Timeout: mcpRegistrationTimeout set, in both the no-headers and with-headers branches. Removes the use of http.DefaultClient.
  • upsertToolServerForRemoteMCPServer(): wraps the transport creation and tool listing in context.WithTimeout(ctx, mcpRegistrationTimeout) so the entire registration sequence is bounded. The database call (RefreshToolsForServer) intentionally uses the original ctx.
  • ReconcileKagentRemoteMCPServer(): logs registration start (url, protocol), success (url, tool count, duration), and failure (error, duration) using structured logr key-value pairs, matching existing codebase conventions.
  • mcpRegistrationTimeout: new unexported constant set to 30 * time.Second, matching DefaultTimeout used in discoverer.go.

Testing

go test -race -skip 'TestE2E.*' ./...

All 26 test packages pass. go vet clean. go build clean.

Copilot AI review requested due to automatic review settings May 6, 2026 11:49
@github-actions github-actions Bot added the bug Something isn't working label May 6, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a controller reliability bug where a hung RemoteMCPServer registration could block all subsequent RemoteMCPServer reconciliations by ensuring the registration workflow is time-bounded and observable via structured logs.

Changes:

  • Adds a 30s registration timeout and applies it to MCP transport creation + tool listing.
  • Ensures newHTTPClient() always returns an *http.Client with a timeout (no longer returns http.DefaultClient).
  • Adds structured logr entries for registration start, success, and failure with duration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to 57
// mcpRegistrationTimeout is the deadline applied to each RemoteMCPServer
// registration attempt (header resolution + MCP connect + tool listing).
// A hung or unreachable endpoint is bounded to this duration, ensuring the
// reconciler goroutine is always released and does not block subsequent
// RemoteMCPServer reconciliations.
mcpRegistrationTimeout = 30 * time.Second
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a remoteMCPRegistrationTimeout() helper that returns spec.timeout when set and falls back to the 30s default when the field is nil. The constant is now just the fallback value.

Comment on lines +983 to +990
// Bound the entire registration sequence (header resolution + MCP connect +
// tool listing) to mcpRegistrationTimeout so that a hung or unreachable
// endpoint cannot block this goroutine — and therefore all subsequent
// RemoteMCPServer reconciliations — indefinitely.
tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout)
defer cancel()

tsp, err := a.createMcpTransport(tCtx, remoteMcpServer)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - upsertToolServerForRemoteMCPServer now calls context.WithTimeout(ctx, remoteMCPRegistrationTimeout(remoteMcpServer)) so operators can tune the deadline per resource via .spec.timeout without any code changes.

Comment on lines 1037 to 1047
// go-sdk does not have a WithHeaders option when initializing transport
// so we need to create a custom HTTP client that adds headers to all requests.
func newHTTPClient(headers map[string]string) *http.Client {
if len(headers) == 0 {
return http.DefaultClient
return &http.Client{
Timeout: mcpRegistrationTimeout,
}
}
return &http.Client{
Timeout: mcpRegistrationTimeout,
Transport: &headerTransport{

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated newHTTPClient to accept an explicit timeout time.Duration parameter. createMcpTransport derives it from remoteMCPRegistrationTimeout(s) and passes it in, so the HTTP client timeout matches the CRD configuration in both branches.

Comment on lines +983 to +995
// Bound the entire registration sequence (header resolution + MCP connect +
// tool listing) to mcpRegistrationTimeout so that a hung or unreachable
// endpoint cannot block this goroutine — and therefore all subsequent
// RemoteMCPServer reconciliations — indefinitely.
tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout)
defer cancel()

tsp, err := a.createMcpTransport(tCtx, remoteMcpServer)
if err != nil {
return nil, fmt.Errorf("failed to create client for toolServer %s: %w", toolServer.Name, err)
}

tools, err := a.listTools(ctx, tsp, toolServer)
tools, err := a.listTools(tCtx, tsp, toolServer)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TestRemoteMCPRegistrationTimeout (covers nil server, nil spec.timeout, and a custom value) and TestNewHTTPClient (covers nil/empty/with-headers - all assert the correct timeout and transport type). These are in reconciler_test.go in the same package.

@jmhbh jmhbh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM overall, just have on enit

// .spec.timeout is not set. A hung or unreachable endpoint is bounded to this
// duration, ensuring the reconciler goroutine is always released and does not
// block subsequent RemoteMCPServer reconciliations.
mcpRegistrationTimeout = 30 * time.Second

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I almost prefer setting this as a default value in remotemcpserver_types.go instead in the spec.Timeout field so its more explicit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added // +kubebuilder:default="30s" on Timeout in remotemcpserver_types.go and regenerated both the base CRD and the Helm CRD template.
The Go fallback in remoteMCPRegistrationTimeout() is kept for existing resources that were created before this default takes effect (nil Timeout at admission time).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jmhbh ,

I’ve addressed the comments put by you earlier, and also got approval from Eltanya.
Could you please have a look as well, when you get a chance? Since merging is currently blocked pending code owner approval.

Thanks!

Fixes the issue 1785

Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
…ient

Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
@AnantKumar17 AnantKumar17 force-pushed the fix/remotemcpserver-registration-timeout branch from b868f98 to f3f1282 Compare May 7, 2026 06:03
@AnantKumar17 AnantKumar17 requested a review from jmhbh May 8, 2026 04:13
@EItanya EItanya merged commit a056369 into kagent-dev:main May 11, 2026
23 checks passed
0xLeo258 pushed a commit to 0xLeo258/kagent that referenced this pull request May 12, 2026
…dev#1805)

## What

Fixes a bug where a single hung `RemoteMCPServer` registration blocks
all subsequent `RemoteMCPServer` reconciliations for the lifetime of the
controller process.

Closes kagent-dev#1785

## Why

`newHTTPClient()` was returning `http.DefaultClient` (no timeout) when
there were no custom headers, and `listTools()` was passing the context
directly to `client.Connect()` and `session.ListTools()` with no
deadline. A hung or unreachable endpoint — for example, an IPv6 address
resolved in an IPv4-only cluster — would block the goroutine
indefinitely, preventing any subsequent `RemoteMCPServer` from
registering until the controller pod was restarted.

## Changes

Single file changed:
`go/core/internal/controller/reconciler/reconciler.go`

- **`newHTTPClient()`**: always returns a `*http.Client` with `Timeout:
mcpRegistrationTimeout` set, in both the no-headers and with-headers
branches. Removes the use of `http.DefaultClient`.
- **`upsertToolServerForRemoteMCPServer()`**: wraps the transport
creation and tool listing in `context.WithTimeout(ctx,
mcpRegistrationTimeout)` so the entire registration sequence is bounded.
The database call (`RefreshToolsForServer`) intentionally uses the
original `ctx`.
- **`ReconcileKagentRemoteMCPServer()`**: logs registration start (url,
protocol), success (url, tool count, duration), and failure (error,
duration) using structured `logr` key-value pairs, matching existing
codebase conventions.
- **`mcpRegistrationTimeout`**: new unexported constant set to `30 *
time.Second`, matching `DefaultTimeout` used in `discoverer.go`.

## Testing

```
go test -race -skip 'TestE2E.*' ./...
```

All 26 test packages pass. `go vet` clean. `go build` clean.

---------

Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations

4 participants