Skip to content

fix(runtime): drain post-success session updates before closing promp…#251

Merged
steipete merged 2 commits into
openclaw:mainfrom
logofet85-ai:fix/acp-post-success-drain-main
Apr 25, 2026
Merged

fix(runtime): drain post-success session updates before closing promp…#251
steipete merged 2 commits into
openclaw:mainfrom
logofet85-ai:fix/acp-post-success-drain-main

Conversation

@logofet85-ai

@logofet85-ai logofet85-ai commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: runPromptTurn closed the prompt turn immediately after client.prompt() resolved on the success path. If the underlying adapter produced any late assistant_delta / tool_call / tool_call_update shortly after the RPC response, the wrapper had already synthesized the turn closure and those updates were dropped, never reaching .acp-stream.jsonl or downstream consumers.
  • Why it matters: real ACP agents (observed with a Claude Code ACP adapter) routinely stream late update events just after the prompt RPC resolves — the adapter announces intent text like "Let me write the changes" and then emits the actual tool_use on the next micro-turn. Losing those updates means tool calls and their results never reach the caller, which breaks multi-step agent runs without any visible error.
  • What changed: the success path now performs the same best-effort, bounded post-success drain that the existing timeout path already uses, via client.waitForSessionUpdatesIdle({ idleMs: 1000, timeoutMs: 5000 }). The drain is opt-in (no-op if the client does not implement it) and fully capped by the existing SESSION_REPLY_IDLE_MS / SESSION_REPLY_DRAIN_TIMEOUT_MS constants.
  • What did NOT change (scope boundary): the catch/timeout path, stopReason / source semantics, conversation-model behavior, cancel path, error propagation, and every other caller contract are untouched. No new exports, no signature changes, no new configuration.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • API / contracts
  • UI / DX

Linked Issue/PR

  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the success path in src/runtime/engine/prompt-turn.ts returned immediately after withTimeout(promptPromise, ...) resolved, while the catch/timeout path already drained late session updates via waitForSessionUpdatesIdle. The asymmetry meant any late assistant_delta / tool_call / tool_call_update arriving after a successful RPC response was discarded because the turn was closed before it could be observed.
  • Missing detection / guardrail: no seam-level test asserted that the drain must run on the success path. The existing acp.v1.session.update.termination conformance case uses a mock agent which emits all updates synchronously before returning stopReason, so the mock path never exercised the race.
  • Contributing context (if known): observed against a Claude Code ACP adapter, which emits the final tool_use on a later tick than the one that resolves session/prompt.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: test/integration.test.ts
  • Scenarios the tests lock in:
    • runPromptTurn: post-success drain runs before closing the turn — asserts the drain is invoked with idleMs=1000, timeoutMs=5000 before the return on the RPC-success path.
    • runPromptTurn: late session updates after successful prompt reach the drain — asserts a late async update emitted during the drain window is observed before the turn closes.
    • runPromptTurn: missing waitForSessionUpdatesIdle still returns cleanly on success — asserts backward compatibility with clients that do not implement the optional drain method.
  • Why this is the smallest reliable guardrail: the drain is a contract between runPromptTurn and the adapter client interface; asserting it at the seam avoids depending on end-to-end adapter behavior while still locking the exact invariant.
  • Existing test that already covers this (if any): none. Conformance case 004 (acp.v1.session.update.termination) does not exercise late post-success updates because the bundled mock agent emits updates synchronously before resolving session/prompt.
  • If no new test is added, why not: n/a.

User-visible / Behavior Changes

  • ACP prompt turns no longer close early when post-success session/update events arrive shortly after session/prompt resolves. Late assistant_delta, tool_call, and tool_call_update events now reach the stream before the turn is closed.
  • Worst-case added latency per successful turn: bounded by the existing constants (1 s idle window, 5 s hard cap). In practice the drain returns as soon as the adapter signals idle.

Diagram (if applicable)

Before:
  session/prompt resolved -> turn closed -> late session/update events lost

After:
  session/prompt resolved -> short idle drain (<=1s idle, <=5s hard cap)
                          -> late session/update events consumed
                          -> turn closed

Security Impact (required)

  • New permissions/capabilities? No.
  • Secrets/tokens handling changed? No.
  • New/changed network calls? No.
  • Command/tool execution surface changed? No.
  • Data access scope changed? No.

Repro + Verification

Environment

  • OS: macOS 15 (Darwin 25.3.0)
  • Runtime: Node v22.22.0 via pnpm 10.23.0
  • Harness / adapter under test: bundled tsx test/mock-agent.ts for conformance + unit-level mock client for the new seam tests
  • Integration/channel: ACP core profile (acp-core-v1)
  • Relevant config: defaults

Steps

  1. pnpm install
  2. pnpm build
  3. pnpm build:test
  4. node --test --test-name-pattern="post-success drain|late session updates after successful prompt|missing waitForSessionUpdatesIdle" dist-test/test/integration.test.js
  5. pnpm run conformance:run -- --format json --report ./conformance-artifacts/update-termination-after.json --case acp.v1.session.update.termination

Expected

  • All three runPromptTurn: tests pass.
  • Conformance case acp.v1.session.update.termination continues to pass ({ cases: 1, passed: 1, failed: 0 }).

Actual (v0.5.3 baseline, before patch)

  • runPromptTurn: post-success drain runs before closing the turn — FAIL (expected ['prompt','drain(1000/5000)'], actual ['prompt']).
  • runPromptTurn: late session updates after successful prompt reach the drain — FAIL (lateUpdateEmitted === false).
  • runPromptTurn: missing waitForSessionUpdatesIdle still returns cleanly on success — PASS (backward compat preserved).
  • Baseline conformance — PASS.

Actual (after patch)

  • All three targeted tests — PASS (3/3 in ~70 ms).
  • Baseline conformance — still PASS (cases: 1, passed: 1, failed: 0, ~250–280 ms).

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Trace snippet (real Claude ACP stream, anonymized, before patch):

10:00:14  assistant_delta: " to harden the smoke_check with explicit, labeled assertions for all four acceptance"
10:00:14  assistant_delta: " conditions. Let me write the changes."
10:00:37  lifecycle:      phase=end
10:00:37  system_event:   "claude run completed."
# no tool_call was ever emitted — the turn closed before it could arrive

After patch (local run, copy-pasted from terminal):

# node --test --test-name-pattern="post-success drain|late session updates after successful prompt|missing waitForSessionUpdatesIdle" dist-test/test/integration.test.js
# tests 3
# pass 3
# fail 0
# duration_ms ~70

# pnpm run conformance:run -- --case acp.v1.session.update.termination
{ "totals": { "cases": 1, "passed": 1, "failed": 0 },
  "results": [{ "id": "acp.v1.session.update.termination", "passed": true, "durationMs": ~280 }] }

Human Verification (required)

  • Verified scenarios:
    • Successful prompt that includes a post-success late update path (unit-level mock).
    • Successful prompt when the client does not implement waitForSessionUpdatesIdle (backward compat).
    • Existing timeout/catch path was not regressed (existing acp.v1.session.update.termination conformance case).
  • Edge cases checked:
    • Drain rejecting / throwing is swallowed via .catch(() => {}) so it cannot turn a successful prompt into an error — confirmed manually by running the seam test with a rejecting waitForSessionUpdatesIdle stub.
    • Backward compatibility when the client does not implement waitForSessionUpdatesIdle — confirmed by the dedicated seam test runPromptTurn: missing waitForSessionUpdatesIdle still returns cleanly on success.
  • What was not verified:
    • No end-to-end run against a real non-mock ACP adapter inside CI; this PR relies on the seam-level tests plus the existing baseline conformance case.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes.
  • Config/env changes? No.
  • Migration needed? No.
  • If yes, exact upgrade steps: n/a.

Risks and Mitigations

  • Risk: the short post-success drain delays the turn close by up to the existing 1 s idle / 5 s hard cap.
    • Mitigation: bounded by the same constants already used on the timeout path; no new configuration surface; can be observed via existing session update logs.
  • Risk: adapters that intentionally never signal idle could sit in the drain window up to the hard cap.
    • Mitigation: drain uses the existing SESSION_REPLY_DRAIN_TIMEOUT_MS (5 s) as an upper bound; beyond that the turn closes regardless.
  • Risk: clients without waitForSessionUpdatesIdle might be surprised by a new call on the success path.
    • Mitigation: the method is optional (?.) and all callers continue to compile; missing-method path is covered by a dedicated test.

AI Assistance

  • AI-assisted
  • Testing: fully tested locally — pnpm build, pnpm build:test, three new seam tests (before + after), and the baseline acp.v1.session.update.termination conformance case (before + after). No real-agent run was performed inside this PR.

@steipete steipete merged commit 3f0c05d into openclaw:main Apr 25, 2026
@steipete

Copy link
Copy Markdown
Contributor

Landed via maintainer fixup and rebase onto main.

  • Gate: pnpm run check passed locally (format, typecheck, lint, build, viewer type/build, coverage: 527 tests)
  • Targeted proof: post-success drain regression tests passed before full gate
  • Land commit: 3f0c05d
  • Source fixup commit before rebase: logofet85-ai@43629e2

Thanks @logofet85-ai!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants