fix(swarm): block downstream tasks when upstream fails by omcdecor-cyber · Pull Request #145 · HKUDS/Vibe-Trading

omcdecor-cyber · 2026-05-28T04:07:00Z

Summary

When a task fails in the DAG, downstream tasks that declare it as a
dependency are dispatched anyway, with an empty upstream context. This is
most visible in the investment_committee preset, where a failed
risk_officer lets portfolio_manager produce a "decision" with no risk
input — clearly unsafe for any production use.

The layer-boundary _sync_run_tasks_snapshot added in #132 surfaces task
failure to polling clients sooner, but it does not gate dispatch — PM
still runs the moment the layer containing the failed risk task finishes.

Root cause

_execute_run flips all_succeeded = False on task failure but never
breaks out of the layer loop or gates the next layer.
The downstream worker's upstream-summary loop in _execute_layer reads
from task_summaries via if source_task_id in task_summaries — a
failed upstream never populates that dict, so the lookup silently
yields no upstream context. The worker then runs with whatever its
prompt template makes of an empty {upstream_context}.

Fix

_execute_layer: before submitting each task, walk task.depends_on
and load each upstream from TaskStore. If any upstream has
status != completed (failed / blocked / cancelled / missing), mark this
task TaskStatus.blocked, record blocked_by + reason, emit
task_blocked, and skip executor.submit. Same-layer peers with no
shared upstream are unaffected — the loop continues.

_execute_run: blocked tasks are absent from layer_results. After the
result-processing loop, any layer_task_id missing from layer_results
was blocked → set all_succeeded = False so the run is finalized as
RunStatus.failed.

Blocked state cascades naturally through deeper layers: layer-3 task Z
depending on a blocked layer-2 task Y will see Y.status="blocked" != completed and block in turn.

Test plan

New file agent/tests/test_swarm_dag_gating.py with 3 regression
tests, all TDD red-green verified against this diff:
- test_failed_upstream_blocks_downstream — PM is blocked, not
  failed or completed, when risk fails
- test_blocked_downstream_emits_task_blocked_event — events.jsonl
  contains task_blocked for PM, NOT task_started
- test_run_marked_failed_when_downstream_blocked — run finalizes as
  RunStatus.failed
Full swarm test suite passes (115 tests in -k swarm, 0 regressions)
Manual replay against the canonical 2-task DAG (risk → pm) confirms
events.jsonl post-fix shows task_blocked not task_started for PM,
and PM's started_at remains None.

Diff is +181 LOC additive (no behavior change for previously-passing DAGs).

Bug --- A failed task in layer N did NOT prevent layer N+1 tasks that declare it as a dependency from being dispatched. The orchestration loop in _execute_run set all_succeeded=False on failure but never gated the next layer. The downstream worker's upstream-summary loop in _execute_layer silently skipped the missing key (the `if source_task_id in task_summaries` check), so the worker ran with no upstream context. This is most visible in the investment_committee preset, where portfolio_manager depends_on=["task-risk"]: a failed risk_officer let PM produce a "decision" with no risk input. Reproducer attached as tests/test_swarm_dag_gating.py shows the same pattern in 3 tasks. The layer-boundary _sync_run_tasks_snapshot added in HKUDS#132 surfaces the failure to polling clients sooner, but it does not gate dispatch — PM still runs the moment the layer with the failed risk task finishes. Fix --- _execute_layer: before dispatch, walk task.depends_on and load each upstream from TaskStore. If any upstream has status != completed (failed / blocked / cancelled / missing), mark this task TaskStatus.blocked, record blocked_by + reason, emit task_blocked, and skip executor.submit. Same-layer peers with no shared upstream are unaffected (the loop continues). _execute_run: tasks blocked in _execute_layer are absent from layer_results. After the result loop, any layer_task_id missing from layer_results was blocked -> set all_succeeded = False so the run is marked RunStatus.failed at finalization. Tests ----- 3 new regression tests in tests/test_swarm_dag_gating.py (TDD red-green verified before-and-after on this PR's diff): - test_failed_upstream_blocks_downstream - test_blocked_downstream_emits_task_blocked_event - test_run_marked_failed_when_downstream_blocked Full swarm test suite: 115/115 pass (112 existing + 3 new).

Follow-up to #145. The new task_blocked event was emitted without agent_id, so the CLI live panel (cli/_legacy.py) could not attach it to the per-agent row — blocked agents stayed visually "pending" until the run finalized. Three small fixes: - runtime.py: pass agent_id=task.agent_id on the task_blocked emit - cli/_legacy.py: handle task_blocked in the live event loop - models.py: extend SwarmEvent docstring to mention task_blocked - test_swarm_dag_gating.py: assert agent_id is present on the event

warren618 merged commit cd817f4 into HKUDS:main May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(swarm): block downstream tasks when upstream fails#145

fix(swarm): block downstream tasks when upstream fails#145
warren618 merged 1 commit into
HKUDS:mainfrom
omcdecor-cyber:fix/swarm-dag-gating-on-failed-upstream

omcdecor-cyber commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

omcdecor-cyber commented May 28, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants