Skip to content

fix(swarm): block downstream tasks when upstream fails#145

Merged
warren618 merged 1 commit into
HKUDS:mainfrom
omcdecor-cyber:fix/swarm-dag-gating-on-failed-upstream
May 28, 2026
Merged

fix(swarm): block downstream tasks when upstream fails#145
warren618 merged 1 commit into
HKUDS:mainfrom
omcdecor-cyber:fix/swarm-dag-gating-on-failed-upstream

Conversation

@omcdecor-cyber

Copy link
Copy Markdown
Contributor

Summary

When a task fails in the DAG, downstream tasks that declare it as a
dependency are dispatched anyway, with an empty upstream context. This is
most visible in the investment_committee preset, where a failed
risk_officer lets portfolio_manager produce a "decision" with no risk
input — clearly unsafe for any production use.

The layer-boundary _sync_run_tasks_snapshot added in #132 surfaces task
failure to polling clients sooner, but it does not gate dispatch — PM
still runs the moment the layer containing the failed risk task finishes.

Root cause

  • _execute_run flips all_succeeded = False on task failure but never
    breaks out of the layer loop or gates the next layer.
  • The downstream worker's upstream-summary loop in _execute_layer reads
    from task_summaries via if source_task_id in task_summaries — a
    failed upstream never populates that dict, so the lookup silently
    yields no upstream context. The worker then runs with whatever its
    prompt template makes of an empty {upstream_context}.

Fix

_execute_layer: before submitting each task, walk task.depends_on
and load each upstream from TaskStore. If any upstream has
status != completed (failed / blocked / cancelled / missing), mark this
task TaskStatus.blocked, record blocked_by + reason, emit
task_blocked, and skip executor.submit. Same-layer peers with no
shared upstream are unaffected — the loop continues.

_execute_run: blocked tasks are absent from layer_results. After the
result-processing loop, any layer_task_id missing from layer_results
was blocked → set all_succeeded = False so the run is finalized as
RunStatus.failed.

Blocked state cascades naturally through deeper layers: layer-3 task Z
depending on a blocked layer-2 task Y will see Y.status="blocked" != completed and block in turn.

Test plan

  • New file agent/tests/test_swarm_dag_gating.py with 3 regression
    tests, all TDD red-green verified against this diff:
    • test_failed_upstream_blocks_downstream — PM is blocked, not
      failed or completed, when risk fails
    • test_blocked_downstream_emits_task_blocked_event — events.jsonl
      contains task_blocked for PM, NOT task_started
    • test_run_marked_failed_when_downstream_blocked — run finalizes as
      RunStatus.failed
  • Full swarm test suite passes (115 tests in -k swarm, 0 regressions)
  • Manual replay against the canonical 2-task DAG (risk → pm) confirms
    events.jsonl post-fix shows task_blocked not task_started for PM,
    and PM's started_at remains None.

Diff is +181 LOC additive (no behavior change for previously-passing DAGs).

Bug
---
A failed task in layer N did NOT prevent layer N+1 tasks that declare it
as a dependency from being dispatched. The orchestration loop in
_execute_run set all_succeeded=False on failure but never gated the next
layer. The downstream worker's upstream-summary loop in _execute_layer
silently skipped the missing key (the `if source_task_id in task_summaries`
check), so the worker ran with no upstream context.

This is most visible in the investment_committee preset, where
portfolio_manager depends_on=["task-risk"]: a failed risk_officer
let PM produce a "decision" with no risk input. Reproducer attached as
tests/test_swarm_dag_gating.py shows the same pattern in 3 tasks.

The layer-boundary _sync_run_tasks_snapshot added in HKUDS#132 surfaces the
failure to polling clients sooner, but it does not gate dispatch — PM
still runs the moment the layer with the failed risk task finishes.

Fix
---
_execute_layer: before dispatch, walk task.depends_on and load each
  upstream from TaskStore. If any upstream has status != completed
  (failed / blocked / cancelled / missing), mark this task
  TaskStatus.blocked, record blocked_by + reason, emit task_blocked,
  and skip executor.submit. Same-layer peers with no shared upstream
  are unaffected (the loop continues).

_execute_run: tasks blocked in _execute_layer are absent from
  layer_results. After the result loop, any layer_task_id missing from
  layer_results was blocked -> set all_succeeded = False so the run is
  marked RunStatus.failed at finalization.

Tests
-----
3 new regression tests in tests/test_swarm_dag_gating.py (TDD red-green
verified before-and-after on this PR's diff):
  - test_failed_upstream_blocks_downstream
  - test_blocked_downstream_emits_task_blocked_event
  - test_run_marked_failed_when_downstream_blocked
Full swarm test suite: 115/115 pass (112 existing + 3 new).
@warren618 warren618 merged commit cd817f4 into HKUDS:main May 28, 2026
warren618 added a commit that referenced this pull request May 28, 2026
Follow-up to #145. The new task_blocked event was emitted without
agent_id, so the CLI live panel (cli/_legacy.py) could not attach it
to the per-agent row — blocked agents stayed visually "pending" until
the run finalized. Three small fixes:

- runtime.py: pass agent_id=task.agent_id on the task_blocked emit
- cli/_legacy.py: handle task_blocked in the live event loop
- models.py: extend SwarmEvent docstring to mention task_blocked
- test_swarm_dag_gating.py: assert agent_id is present on the event
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants