Skip to content

fix(grok-build): bind fresh grok watchers + reap dead-grok orphans (#245)#253

Merged
fujibee merged 6 commits into
mainfrom
fix/grok-watch-session-bind
Jun 28, 2026
Merged

fix(grok-build): bind fresh grok watchers + reap dead-grok orphans (#245)#253
fujibee merged 6 commits into
mainfrom
fix/grok-watch-session-bind

Conversation

@fujibee

@fujibee fujibee commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Problem

Grok Build's monitor tool launches the agmsg watcher with an empty
$GROK_SESSION_ID. Two failure modes followed:

  1. Lingering orphan watchers. An empty session id fell back to a bare
    throwaway instance id. The watch.sh: self-exit when owning CC session is no longer alive #67 liveness guard only applies to composite
    ids (<sid>.<pid>), so a bare-id watcher never self-exits when its grok
    session dies — it lingers indefinitely (observed: a watcher still alive
    3h+ after its grok had exited). A prior fix bound a composite id only for
    grok --resume <id> sessions (id present in argv); a fresh grok (no
    --resume) still fell back to bare.

  2. Silent message loss. If the watcher's stdout consumer truncates the
    stream (e.g. a | head -5 appended to the monitor command), writes past
    the consumer's close are dropped while the watermark advances past them,
    so later messages are lost silently.

Fix

  • agmsg_grok_instance_id (generalizes the former resume-only helper):
    resolves a composite <session_id>.<grok_pid> for both grok --resume and
    a fresh grok. For a fresh grok it binds to the grok process that is the
    watcher's ancestor, paired with the project's newest session id. Composite
    ids are stable across watcher relaunches (no watermark/replay gap) and
    liveness-gated, so the watcher self-exits when grok dies.
  • agmsg_reap_orphan_grok_watchers: on startup a fresh grok watcher reaps
    leftover bare-id grok-build watchers whose grok has already exited.
    Specific-PID kill only; a strict invocation match (watch.sh must be the
    executed program, with grok-build and the project as positional args)
    ensures a process that merely mentions those strings is never killed.
  • grok-build monitor rule + onboarding template: instruct agents to pass the
    monitor command exactly, with no | head / | tail / pipe / redirection
    appended (a closed downstream pipe drops messages silently).

Tests

New regression tests (bats):

  • composite resolution for resume and fresh grok; newest-session-dir; grok
    ancestor resolution
  • watcher-arg matching incl. false-positive exclusion (a shell that merely
    mentions the strings is not matched)
  • reaper safety (no-op / self-safe when nothing matches)
  • an 8-message burst delivered with no loss

All changed suites green locally (test_instance_id.bats, test_watch.bats).

Verification (local machine)

  • Composite resolution against a live grok session yields <id>.<pid>
    (liveness-gated on the grok pid).
  • Reaper dry-run: keeps a watcher whose grok is alive, flags only watchers
    whose grok has exited.
  • Before/after for the loss path: with | head -5, only 5 of 8 messages were
    delivered; without it, all 8 were delivered.

Closes #245.

Update: monitor-mode launch hardening

Live dogfooding surfaced that the silent "messages never arrive" reports (and
accumulating watcher tasks) were caused by the launch method, not the watcher:
the grok-build rule described delivery as "a background watcher you launch with
the monitor tool", and the word "background" led agents to start watch.sh
via run_terminal_command with background: true. That logs the stream to a
file and never injects it into the conversation — a silent failure. (The
emit/watermark print logic is unchanged; the session-binding work in this PR did
not cause it.)

This adds launch-method hardening to the grok-build delivery rule and onboarding
template:

  • delivery is described as coming from the monitor tool (drops the
    "background watcher" framing)
  • explicit prohibition: launch with the monitor tool only — never
    run_terminal_command (with or without background: true), and never a
    hand-rolled tail -f of a log
  • a verify step: confirm a live monitor task named agmsg inbox stream exists
    after launch, and relaunch via the tool if it is missing

fujibee added 6 commits June 27, 2026 12:13
…owaway id

Grok Build's `monitor` tool runs watch.sh with $GROK_SESSION_ID unset and
detached from grok's process tree, so neither the env var nor the instance-id
ppid walk resolves grok's session. #238 then minted a fresh throwaway id per
launch -> a fresh watermark each time (replay/"start from now" gaps) and no
liveness gating (bare id skips the #67 guard).

Resolve grok's real session instead: find the live `grok --resume <id>` whose
<id> lives in this project's grok session dir and key the watcher on <id>.<pid>.
That id is stable across relaunches (same session -> same watermark/pidfile, no
gap) and liveness-gated on the grok pid. Falls back to the throwaway id only when
no matching grok process is found, so the watcher still starts. claude-code is
unaffected (it always bakes a real CLAUDE_CODE_SESSION_ID).
)

The empty-session-id grok watcher only bound a composite (liveness-gated)
instance id for 'grok --resume <id>' sessions, where the id is in argv. A
fresh 'grok' (no --resume) still fell back to a bare throwaway id, which
skips the #67 liveness guard and so lingers forever after grok exits.

- agmsg_grok_instance_id (was _resume_instance_id): also handles a fresh grok
  by binding to the ancestor grok process that launched the watcher, paired
  with the project's newest session id -> composite <id>.<grok_pid>, stable
  across relaunches and liveness-gated.
- agmsg_reap_orphan_grok_watchers: a fresh grok watcher reaps leftover bare-id
  grok-build watchers whose grok has exited. Specific-PID kill only, with a
  strict invocation match (watch.sh is the executed program) so a process that
  merely mentions the strings is never killed.
- grok-build monitor rule + template: forbid appending head/tail/pipe to the
  monitor command (a closed downstream pipe drops messages silently).
- tests: composite resolution (resume + fresh), newest-session, ancestor,
  watcher-arg matching incl. false-positive exclusion, reaper safety, and an
  8-message no-loss burst regression.
#245)

Review follow-up: the resume path scanned all live `grok` via pgrep and took
the first whose --resume id matched the project dir, not the watcher's own
ancestor grok. With several grok sessions sharing a project, watcher B could
bind to watcher A's <id>.<pid> — colliding pidfile/watermark and gating
liveness on the wrong grok. Resolve via agmsg_grok_ancestor_pid first (extract
--resume id from the ancestor's args, else newest session id); keep the pgrep
scan only as a fallback when no ancestor is in the tree. Add a regression test
for the multi-grok ancestor preference, and note the reaper matcher's
whitespace-path fail-closed limitation.
watch.sh runs under `set -u`. The reaper scans `ps -eo pid=,args=`, which
includes kernel/system processes with empty args; agmsg_args_is_grok_watcher
then did ${1##*/} / ${2##*/} on unset positional params, tripping nounset and
exiting the watcher on startup before it ever armed. Guard the positional
accesses ($# checks, default-expansions) and skip empty-args rows in the
reaper loop. Add regression tests that exercise the matcher and reaper under
`bash -u` against the real process table.
Root cause of the silent 'messages never arrive' + piled-up watcher tasks was
the LAUNCH METHOD, not the watcher: the grok rule said messages come from 'a
background watcher you launch with the monitor tool', and 'background' lured the
agent into starting watch.sh via run_terminal_command (background:true) — which
logs stdout to a file and never injects it into the conversation. Only the
`monitor` tool surfaces the stream. (The emit/watermark print logic is
unchanged; #245's binding work did not cause this.)

Harden the grok-build delivery rule + onboarding template:
- drop the 'background watcher' framing; state messages are delivered by the
  `monitor` tool.
- explicit prohibition: launch with the `monitor` tool ONLY — never
  run_terminal_command (with/without background:true), never a hand-rolled
  tail -f of a log.
- a verify step: confirm a live `monitor` task named 'agmsg inbox stream'
  exists after launch; relaunch via the tool if not.
- keep the no head/tail/pipe note.
Grok's monitor delivery is still stabilizing (the launch-method and
session-binding issues this PR addresses), so flag it BETA at selection time,
mirroring the codex monitor BETA label. turn mode stays the stable default.
@fujibee fujibee merged commit 8647f72 into main Jun 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

grok-build monitor: silent message loss when the watcher's stdout consumer dies (head/tail truncation + watermark advance)

1 participant