Skip to content

feat(spawn): readiness handshake by default (status=ready / --no-wait / --ready-timeout)#113

Merged
fujibee merged 2 commits into
mainfrom
feat/spawn-handshake
Jun 14, 2026
Merged

feat(spawn): readiness handshake by default (status=ready / --no-wait / --ready-timeout)#113
fujibee merged 2 commits into
mainfrom
feat/spawn-handshake

Conversation

@fujibee

@fujibee fujibee commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Closes #108.

Stacked on #111 (#107 watermark fix). Base is fix/watch-watermark so this diff shows only the handshake changes; GitHub will retarget it to main once #111 merges.

Problem

spawn returned as soon as the terminal/pane launched, but the new agent wasn't receiving yet — it still had to boot the CLI and run actas before its watcher (watch.sh) attached. A job the leader sent in that window was lost, and the only way to tell an agent was ready was to scrape its pane. (Reported: 6 of 9 jobs dropped in a spawned-worker crew.)

Fix — readiness handshake, on by default

  • watch.sh in exclusive (actas) mode touches a readiness sentinel (run/ready.<team>__<name>) once its subscription and watermark are set — i.e. the moment it will deliver anything that arrives from here on — and removes it on exit. The sentinel is present iff a live watcher is receiving for that role. (agmsg_ready_path added to lib/actas-lock.sh, same encoding as the lock path, so spawn and watch agree with no env plumbing.)
  • spawn clears the sentinel, launches, then polls and returns status=ready. --ready-timeout <secs> (default 90) bounds the wait; on timeout it prints status=timeout and exits 3 so the caller can re-spawn. --no-wait opts out. Codex skips the wait (no Monitor).

A spawned agent always starts its watcher via actas, so there is no boot-prompt or cmd-template change — readiness is simply "the watcher attached". This complements #107: #107 stops in-gap loss for restarts generally; this makes spawn readiness a positive signal instead of a race.

Tests

  • handshake: status=ready when the watcher attaches; status=timeout/exit 3 when nothing attaches; --no-wait returns immediately; codex skips the wait.
  • watch.sh: creates and removes the sentinel in actas mode; does not create one in broad mode.
  • Existing launch tests pass --no-wait (the stub env has no real watcher).

Full suite green (214).

fujibee added a commit that referenced this pull request Jun 14, 2026
Addresses aggie-co's review of #113:

- Ready sentinel now records the owner session_id, and watch.sh cleanup only
  removes it when the content is still ours. This keeps "sentinel present iff a
  live watcher is receiving" honest across a quick actas restart, where the old
  watcher's EXIT could otherwise delete a successor's freshly-written sentinel.
- session-start.sh now garbage-collects stale stream watermarks (#107) and
  readiness sentinels (#108) whose owner session_id is no longer alive — the
  SIGKILL / terminal-crash case that bypasses the EXIT trap. Both files are
  advisory (a live watcher rewrites them; spawn clears the sentinel before
  use), so this is hygiene.

Tests: ready sentinel records the owner; cleanup leaves a successor-owned
sentinel; session-start GCs stale watermark/ready but keeps live ones.
Full suite green (217).
fujibee added 2 commits June 14, 2026 16:17
spawn returned as soon as the terminal launched, but the new agent wasn't
receiving yet — it still had to boot the CLI and run actas before its watcher
attached. A job sent in that window was lost. The only readiness signal was
scraping the pane.

spawn now BLOCKS by default until the new agent is actually listening:

- watch.sh, when it attaches in exclusive (actas) mode, touches a readiness
  sentinel (run/ready.<team>__<name>) once its subscription and watermark are
  set — i.e. once it will deliver anything that arrives from here on — and
  removes it on exit. So the sentinel is present iff a live watcher is
  receiving for that role. (agmsg_ready_path added to lib/actas-lock.sh; same
  encoding as the lock path, so spawn and watch agree with no env plumbing.)
- spawn clears the sentinel, launches, then polls for it and returns
  `status=ready`. `--ready-timeout <secs>` (default 90) bounds the wait;
  on timeout it prints `status=timeout` and exits 3 so the caller can
  re-spawn. `--no-wait` opts out. Codex skips the wait (no Monitor).

A spawned agent always starts its watcher via actas, so no boot-prompt or
cmd-template change is needed — readiness is just the watcher attaching.
Complements #107: that stops in-gap loss for restarts generally; this makes
spawn readiness a positive signal instead of a guess.

Tests: handshake ready/timeout/--no-wait/codex-skip; watch.sh creates and
removes the sentinel in actas mode and not in broad mode. Existing launch
tests pass --no-wait (no real watcher in the stub env). Full suite green.
Addresses aggie-co's review of #113:

- Ready sentinel now records the owner session_id, and watch.sh cleanup only
  removes it when the content is still ours. This keeps "sentinel present iff a
  live watcher is receiving" honest across a quick actas restart, where the old
  watcher's EXIT could otherwise delete a successor's freshly-written sentinel.
- session-start.sh now garbage-collects stale stream watermarks (#107) and
  readiness sentinels (#108) whose owner session_id is no longer alive — the
  SIGKILL / terminal-crash case that bypasses the EXIT trap. Both files are
  advisory (a live watcher rewrites them; spawn clears the sentinel before
  use), so this is hygiene.

Tests: ready sentinel records the owner; cleanup leaves a successor-owned
sentinel; session-start GCs stale watermark/ready but keeps live ones.
Full suite green (217).
@fujibee fujibee force-pushed the feat/spawn-handshake branch from 86b2f29 to 03b1a18 Compare June 14, 2026 23:23
@fujibee fujibee changed the base branch from fix/watch-watermark to main June 14, 2026 23:23
@fujibee fujibee merged commit 4828c9b into main Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(spawn): readiness handshake by default (--wait-ready) — eliminate first-push drop

1 participant