fix(grok-build): bind fresh grok watchers + reap dead-grok orphans (#245)#253
Merged
Conversation
…owaway id Grok Build's `monitor` tool runs watch.sh with $GROK_SESSION_ID unset and detached from grok's process tree, so neither the env var nor the instance-id ppid walk resolves grok's session. #238 then minted a fresh throwaway id per launch -> a fresh watermark each time (replay/"start from now" gaps) and no liveness gating (bare id skips the #67 guard). Resolve grok's real session instead: find the live `grok --resume <id>` whose <id> lives in this project's grok session dir and key the watcher on <id>.<pid>. That id is stable across relaunches (same session -> same watermark/pidfile, no gap) and liveness-gated on the grok pid. Falls back to the throwaway id only when no matching grok process is found, so the watcher still starts. claude-code is unaffected (it always bakes a real CLAUDE_CODE_SESSION_ID).
) The empty-session-id grok watcher only bound a composite (liveness-gated) instance id for 'grok --resume <id>' sessions, where the id is in argv. A fresh 'grok' (no --resume) still fell back to a bare throwaway id, which skips the #67 liveness guard and so lingers forever after grok exits. - agmsg_grok_instance_id (was _resume_instance_id): also handles a fresh grok by binding to the ancestor grok process that launched the watcher, paired with the project's newest session id -> composite <id>.<grok_pid>, stable across relaunches and liveness-gated. - agmsg_reap_orphan_grok_watchers: a fresh grok watcher reaps leftover bare-id grok-build watchers whose grok has exited. Specific-PID kill only, with a strict invocation match (watch.sh is the executed program) so a process that merely mentions the strings is never killed. - grok-build monitor rule + template: forbid appending head/tail/pipe to the monitor command (a closed downstream pipe drops messages silently). - tests: composite resolution (resume + fresh), newest-session, ancestor, watcher-arg matching incl. false-positive exclusion, reaper safety, and an 8-message no-loss burst regression.
#245) Review follow-up: the resume path scanned all live `grok` via pgrep and took the first whose --resume id matched the project dir, not the watcher's own ancestor grok. With several grok sessions sharing a project, watcher B could bind to watcher A's <id>.<pid> — colliding pidfile/watermark and gating liveness on the wrong grok. Resolve via agmsg_grok_ancestor_pid first (extract --resume id from the ancestor's args, else newest session id); keep the pgrep scan only as a fallback when no ancestor is in the tree. Add a regression test for the multi-grok ancestor preference, and note the reaper matcher's whitespace-path fail-closed limitation.
watch.sh runs under `set -u`. The reaper scans `ps -eo pid=,args=`, which
includes kernel/system processes with empty args; agmsg_args_is_grok_watcher
then did ${1##*/} / ${2##*/} on unset positional params, tripping nounset and
exiting the watcher on startup before it ever armed. Guard the positional
accesses ($# checks, default-expansions) and skip empty-args rows in the
reaper loop. Add regression tests that exercise the matcher and reaper under
`bash -u` against the real process table.
Root cause of the silent 'messages never arrive' + piled-up watcher tasks was the LAUNCH METHOD, not the watcher: the grok rule said messages come from 'a background watcher you launch with the monitor tool', and 'background' lured the agent into starting watch.sh via run_terminal_command (background:true) — which logs stdout to a file and never injects it into the conversation. Only the `monitor` tool surfaces the stream. (The emit/watermark print logic is unchanged; #245's binding work did not cause this.) Harden the grok-build delivery rule + onboarding template: - drop the 'background watcher' framing; state messages are delivered by the `monitor` tool. - explicit prohibition: launch with the `monitor` tool ONLY — never run_terminal_command (with/without background:true), never a hand-rolled tail -f of a log. - a verify step: confirm a live `monitor` task named 'agmsg inbox stream' exists after launch; relaunch via the tool if not. - keep the no head/tail/pipe note.
Grok's monitor delivery is still stabilizing (the launch-method and session-binding issues this PR addresses), so flag it BETA at selection time, mirroring the codex monitor BETA label. turn mode stays the stable default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Grok Build's
monitortool launches the agmsg watcher with an empty$GROK_SESSION_ID. Two failure modes followed:Lingering orphan watchers. An empty session id fell back to a bare
throwaway instance id. The watch.sh: self-exit when owning CC session is no longer alive #67 liveness guard only applies to composite
ids (
<sid>.<pid>), so a bare-id watcher never self-exits when its groksession dies — it lingers indefinitely (observed: a watcher still alive
3h+ after its grok had exited). A prior fix bound a composite id only for
grok --resume <id>sessions (id present in argv); a freshgrok(no--resume) still fell back to bare.Silent message loss. If the watcher's stdout consumer truncates the
stream (e.g. a
| head -5appended to the monitor command), writes pastthe consumer's close are dropped while the watermark advances past them,
so later messages are lost silently.
Fix
agmsg_grok_instance_id(generalizes the former resume-only helper):resolves a composite
<session_id>.<grok_pid>for bothgrok --resumeanda fresh
grok. For a fresh grok it binds to the grok process that is thewatcher's ancestor, paired with the project's newest session id. Composite
ids are stable across watcher relaunches (no watermark/replay gap) and
liveness-gated, so the watcher self-exits when grok dies.
agmsg_reap_orphan_grok_watchers: on startup a fresh grok watcher reapsleftover bare-id grok-build watchers whose grok has already exited.
Specific-PID kill only; a strict invocation match (watch.sh must be the
executed program, with grok-build and the project as positional args)
ensures a process that merely mentions those strings is never killed.
monitor command exactly, with no
| head/| tail/ pipe / redirectionappended (a closed downstream pipe drops messages silently).
Tests
New regression tests (bats):
ancestor resolution
mentions the strings is not matched)
All changed suites green locally (
test_instance_id.bats,test_watch.bats).Verification (local machine)
<id>.<pid>(liveness-gated on the grok pid).
whose grok has exited.
| head -5, only 5 of 8 messages weredelivered; without it, all 8 were delivered.
Closes #245.
Update: monitor-mode launch hardening
Live dogfooding surfaced that the silent "messages never arrive" reports (and
accumulating watcher tasks) were caused by the launch method, not the watcher:
the grok-build rule described delivery as "a background watcher you launch with
the
monitortool", and the word "background" led agents to startwatch.shvia
run_terminal_commandwithbackground: true. That logs the stream to afile and never injects it into the conversation — a silent failure. (The
emit/watermark print logic is unchanged; the session-binding work in this PR did
not cause it.)
This adds launch-method hardening to the grok-build delivery rule and onboarding
template:
monitortool (drops the"background watcher" framing)
monitortool only — neverrun_terminal_command(with or withoutbackground: true), and never ahand-rolled
tail -fof a logmonitortask namedagmsg inbox streamexistsafter launch, and relaunch via the tool if it is missing