Skip to content

feat(despawn): tear down spawned crew members (#109)#129

Merged
fujibee merged 2 commits into
mainfrom
feat/despawn
Jun 16, 2026
Merged

feat(despawn): tear down spawned crew members (#109)#129
fujibee merged 2 commits into
mainfrom
feat/despawn

Conversation

@fujibee

@fujibee fujibee commented Jun 15, 2026

Copy link
Copy Markdown
Owner

What

Add despawn — the inverse of spawn — to tear down a spawned crew member.

  • graceful (default): send a ctrl:despawn control message to the member. Its watcher (watch.sh) sees it, drops its own role (releasing the actas lock + registration) and closes its own tmux pane, which ends the member CLI. The leader blocks until the lock is released, up to --timeout (default 30s).
  • --force: tear the member down from the placement recorded at spawn time — kill its tmux pane/window and drop its registration — for when the member's own watcher can't respond (dead watcher, or a codex member with no Monitor).

How

  • scripts/despawn.sh: despawn.sh <team> <from> <name> [--force] [--timeout <secs>].
  • scripts/spawn.sh: records the member's tmux target id + project + type via a new agmsg_spawn_path helper at launch, so --force has a placement to act on.
  • scripts/lib/actas-lock.sh: agmsg_spawn_path placement-record path helper.
  • scripts/watch.sh: handle ctrl:despawn. Only an exclusive (actas) watcher dedicated to exactly the despawned role tears itself down. A broad-subscription watcher must skip the control message — its $TMUX_PANE is the leader's own pane, so acting on a despawn for any subscribed role would kill the leader's session. This is the core safety fix.

Command surface

despawn is wired as a first-class command, not just a script:

  • templates/cmd.claude-code.md + cmd.codex.md: a despawn handler mapping /<cmd> despawn <name> [--force] [--timeout N] to despawn.sh, plus the help line shown at join.
  • README.md: a "Tear down a spawned agent (despawn)" section next to spawn, and a cheatsheet line.
  • SKILL.md: the despawn invocation documented alongside spawn.

Tests

  • tests/test_despawn.bats: graceful teardown, --force placement teardown, --force error with no record, graceful timeout (exit 3), graceful no-op when no live lock, and a regression test asserting a broad watcher ignores ctrl:despawn and does not self-destruct.
  • The test harness unsets TMUX_PANE when launching the watcher under test, so the ctrl:despawn handler never targets a real pane.

Validation

  • Full suite: 249/249 green.
  • Packaging smoke (isolated HOME, clean install): despawn.sh is packaged and runs end-to-end (graceful no-op + --force no-record error paths).
  • Local end-to-end run confirmed both paths on real tmux members: graceful self-teardown (~1s, window closes, role/registration dropped) and --force (pane kill + registration drop) — the leader session survives in both cases (the bug this fixes).
  • A nested run (leader → spawn A → A spawns B → A despawns B → leader despawns A) confirmed self-teardown at every level with no orphaned panes/watchers/records and the despawning session always surviving.
  • Packaging smoke also asserts the installed command template carries the despawn handler and despawn.sh is shipped.

Closes #109.

fujibee added 2 commits June 15, 2026 11:01
Add despawn.sh, the inverse of spawn.sh:
- graceful: send ctrl:despawn to the member; its watcher drops its own
  role (releasing the actas lock) and closes its own tmux pane. Block
  until the lock is released, up to --timeout.
- --force: tear the member down from the placement recorded at spawn
  time — kill its tmux pane/window and drop its registration — for when
  the member's watcher can't respond.

spawn.sh records the member's tmux target id + project + type via the new
agmsg_spawn_path helper so --force has a placement to act on.

watch.sh: only an EXCLUSIVE (actas) watcher dedicated to exactly the
despawned role tears itself down on ctrl:despawn. A broad-subscription
watcher must skip it — its $TMUX_PANE is the leader's own pane, so acting
on a despawn for any subscribed role would kill the leader's session.

tests: unset TMUX_PANE when launching the watcher under test so the
ctrl:despawn handler never targets the developer's real pane; add a
regression test asserting a broad watcher ignores ctrl:despawn.
)

despawn.sh existed but was only reachable by calling the script directly.
Expose it as a first-class command:
- templates/cmd.claude-code.md, cmd.codex.md: add a 'despawn' handler that
  maps '/<cmd> despawn <name> [--force] [--timeout N]' to despawn.sh, plus the
  help line in the join summary.
- README.md: add a 'Tear down a spawned agent (despawn)' section next to spawn,
  and a cheatsheet line.
- SKILL.md: document the despawn invocation alongside spawn.
@fujibee fujibee merged commit 647688a into main Jun 16, 2026
3 checks passed
@fujibee fujibee deleted the feat/despawn branch June 16, 2026 00:28
This was referenced Jun 17, 2026
fujibee added a commit that referenced this pull request Jun 18, 2026
Build the full Codex pseudo-monitor on top of the re-arm fix, ported clean onto
current main (despawn + instance-id already landed).

New scripts:
- codex-shim.sh: PATH-front entrypoint that routes interactive codex / codex
  resume launches in monitor-mode projects through codex-monitor.sh; everything
  else (--version, exec, non-monitor projects) passes through to real Codex.
- codex-monitor.sh: starts/reuses the shared app-server, launches the
  out-of-sandbox bridge launcher, then execs the TUI on the same unix socket.
- codex-bridge-launcher.sh: polls the request file and starts codex-bridge.js
  outside the Codex sandbox (a hook-launched bridge cannot connect to the
  socket from inside the sandbox).
- codex-shim-install.sh: installs/removes the ~/.agents/bin/codex shim.

Wiring grafted onto main's versions (preserving #129/#132/#128):
- session-start.sh: codex branch writes a request file (launcher mode) or
  nohup-launches the bridge. Resolves the thread id via CODEX_THREAD_ID, then
  falls back to the newest rollout whose session_meta cwd matches the project —
  fresh / "codex exec" sessions never export CODEX_THREAD_ID, so without this
  the bridge only auto-launched on the interactive resume path (launch-gap #41).
- delivery.sh: "set monitor" for codex installs the shim + prints guidance;
  rejects "both".
- install.sh: configure_codex_sandbox adds db/teams/run to Codex
  sandbox_workspace_write.writable_roots.

Tests: codex shim, session-start codex (incl. rollout fallback), delivery
monitor/both codex, install writable_roots. Full suite 314 green.
fujibee added a commit that referenced this pull request Jun 18, 2026
… + fresh-session launch (#41) (#148)

* feat(codex-bridge): re-arm watch-once without depending on turn/completed (#41)

Vertical slice of the Codex app-server bridge, rebuilt clean on current main
(despawn + instance-id already landed). Carries the working WebSocket-over-UDS
transport and watch-once oracle, and fixes the delivery bug.

Root cause: the bridge re-armed watch-once only from onTurnCompleted
(turn/completed | turn/failed). The real Codex app-server does not reliably
deliver those, so after the first wake the bridge started a turn and never
re-armed — it handled one message then slept. The fake app-server in tests
sends turn/completed explicitly, hiding it.

Fix: route turn teardown through a single onTurnEnded() reachable from
turn/completed, thread/status idle, OR a turn watchdog (--turn-timeout, default
60s). Detection re-arms when the turn ends, so a missing completion event no
longer parks the bridge. The stale max_id guard is retained.

Tests: two regressions where the fake never sends turn/completed — one driven
by the watchdog, one by an idle status — asserting a second wakeup occurs.
Full codex-bridge + watch-once suites green (12 + 6).

* feat(codex-bridge): wire up the Codex monitor stack on main (#41)

Build the full Codex pseudo-monitor on top of the re-arm fix, ported clean onto
current main (despawn + instance-id already landed).

New scripts:
- codex-shim.sh: PATH-front entrypoint that routes interactive codex / codex
  resume launches in monitor-mode projects through codex-monitor.sh; everything
  else (--version, exec, non-monitor projects) passes through to real Codex.
- codex-monitor.sh: starts/reuses the shared app-server, launches the
  out-of-sandbox bridge launcher, then execs the TUI on the same unix socket.
- codex-bridge-launcher.sh: polls the request file and starts codex-bridge.js
  outside the Codex sandbox (a hook-launched bridge cannot connect to the
  socket from inside the sandbox).
- codex-shim-install.sh: installs/removes the ~/.agents/bin/codex shim.

Wiring grafted onto main's versions (preserving #129/#132/#128):
- session-start.sh: codex branch writes a request file (launcher mode) or
  nohup-launches the bridge. Resolves the thread id via CODEX_THREAD_ID, then
  falls back to the newest rollout whose session_meta cwd matches the project —
  fresh / "codex exec" sessions never export CODEX_THREAD_ID, so without this
  the bridge only auto-launched on the interactive resume path (launch-gap #41).
- delivery.sh: "set monitor" for codex installs the shim + prints guidance;
  rejects "both".
- install.sh: configure_codex_sandbox adds db/teams/run to Codex
  sandbox_workspace_write.writable_roots.

Tests: codex shim, session-start codex (incl. rollout fallback), delivery
monitor/both codex, install writable_roots. Full suite 314 green.

* docs(codex-bridge): document the Codex monitor beta (#41)

- templates/cmd.codex.md: offer monitor as a delivery mode for codex (the
  app-server bridge beta), default to it on first join, and accept "mode
  monitor"; keep rejecting "both".
- README.md: describe the Codex monitor beta in the Codex section and the
  intro, pointing at docs/codex-monitor-beta.md for setup and internals.

* test(codex): sandbox HOME in setup_test_env so the suite never touches real state (#41)

Running the bats suite was clobbering the developer's real shim: a couple of
tests install the codex shim (delivery set monitor codex -> codex-shim-install
writes $HOME/.agents/bin/codex) and configure $HOME/.codex/config.toml, but
setup_test_env did not sandbox HOME. The shim ended up pointing at the test's
temp scripts dir, which dangled once the temp dir was torn down. CI never caught
it (clean containers have no real ~/.agents to break).

Sandbox HOME under the per-test temp skill dir in setup_test_env. bats runs each
test in its own subshell, so the export is scoped per test and needs no restore.
test_install.bats keeps its own FAKE_HOME and is unaffected.

* feat(codex): flag missing PATH / Node when enabling monitor mode (#41)

The two silent first-run failures for the Codex monitor beta were a shim that
isn't on PATH and a missing Node — in both cases the bridge just never starts.
`delivery set monitor codex` now surfaces them at enable time:

- If ~/.agents/bin is not on PATH, print a loud WARNING that the shim won't
  intercept `codex` and the bridge will not engage (this is the #1 cause of
  "I turned monitor on and nothing happens").
- If Node is missing, print a WARNING that the bridge needs Node. Presence only,
  no version gate: the bridge uses old/stable APIs (one optional-chaining call,
  Node 14+) and any Node new enough to run Codex runs it. AGMSG_CODEX_NODE
  overrides the checked binary (and makes the path testable).

Tests for both warnings.

* docs(codex-bridge): fix the flow diagram — hook writes a request file, launcher starts the bridge (#41)

The diagram and prose claimed the SessionStart hook starts codex-bridge.js
directly. That is the broken in-sandbox path (socket connect EPERM). Correct it:
the hook resolves the thread (CODEX_THREAD_ID or newest matching rollout) and
writes a request file; codex-monitor.sh's out-of-sandbox launcher reads it and
starts the bridge, which connects over WebSocket-over-UDS. Add the re-arm loop
and turn serialization.

* feat(codex): emphasize the monitor beta and clean up on mode off (#41)

Make the experimental nature and the PATH cost of the Codex monitor beta hard to
miss, and make turning it off actually tear things down:

- cmd.codex.md: move monitor to the LAST delivery option (3, "BETA, advanced"),
  default to turn, and spell out that it installs a PATH shim that intercepts
  the `codex` command — opt in only if you understand PATH precedence.
- README + docs/codex-monitor-beta.md: prominent warning blocks (changes how
  Codex starts, ~/.agents/bin must be first on PATH, known rough edges #149/#150).
- delivery.sh: `set off codex` now stops the project's bridge process(es) and
  removes their run files (the manual counterpart to the not-yet-wired auto
  teardown, #149), and tells the user the shim is global — how to remove it and
  restore PATH if no other project uses monitor. Leaves the shared app-server
  and shim in place. Test included.

* feat(codex): tell users to restart Codex for monitor to take effect (#151)

Enabling mode monitor only launches the bridge on the next Codex SessionStart, so
an already-running session (or an off->on toggle) stays unmonitored until restart.
delivery.sh, the cmd.codex.md prompt guidance, and docs now say so explicitly.
The actual in-session (re)launch is tracked in #151.

* docs(codex): clarify bridge starts on first turn, not at launch

Codex fires the SessionStart hook on the session's first turn (first
message), not the moment the TUI opens, so the monitor bridge is absent
until the user interacts once after a restart. Update the delivery.sh
enable message, the cmd.codex.md prompts, and the beta doc to say
"restart and send your first message". Note the launcher/request-file
redundancy review (#153) in the bridge mechanics.

* feat(codex): lower monitor poll interval default from 5s to 2s

The watch-once poll interval governs how fast an idle, armed bridge
detects a new unread message. Drop the default from 5s to 2s for
snappier delivery. Safe against double-injection: watch-once is one-shot
(exits on first detection), re-arm only happens after the turn ends, and
the stale-max_id guard backstops any re-detection — none of which the
interval affects. Update both the bridge default and the watch-once.sh
standalone fallback to keep them in sync.

* fix(codex): deliver deferred wake on turn end; de-dup Codex writable_roots

Two blocking issues from review:

1) codex-bridge.js: a wake observed while the resumed thread was still active
   (SessionStart fires on the first user turn) was deferred by tryStartTurn(),
   but onTurnEnded() re-armed instead of delivering it. The next watch-once
   re-observed the same unread max_id and the stale-wake guard stopped the
   bridge with exit 1 before the message was delivered. onTurnEnded() now
   delivers a pending wake via tryStartTurn() rather than re-arming.

2) install.sh: Codex writable_roots was configured by TWO copies — a legacy
   inline block (db/teams only, old closing-bracket awk) and configure_codex_
   sandbox() (db/teams/run). On a fresh install both ran, and the inline copy's
   empty-array handling emitted a leading comma, which combined with the
   function's insert to produce invalid TOML (`["run", , "db", "teams"]`).
   Removed the legacy inline duplicate and rewrote the function to insert after
   the opening '[' — valid for empty/single-line/multiline arrays.

Adds regression tests for both (the bridge test fails without the fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(spawn): despawn — graceful crew teardown (ctrl:despawn) with --force fallback

1 participant