Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage#5460
Conversation
Add a scheduled agentic workflow that reviews self-hosted, ARC/DinD, GHEC, and GHES issues and PRs updated since the previous run and proposes knowledge-base updates for the Self-Hosted Runner Doctor as a single new issue. Includes a compiled lock file and a config test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a reusable “Self-Hosted Runner Doctor” slash-command agentic workflow to triage ARC/DinD + enterprise/self-hosted runner failure reports, backed by a shared failure-mode catalog and expanded troubleshooting documentation.
Changes:
- Introduces
/runner-doctoragentic workflow source + compiled lock workflow for immediate use. - Adds a shared knowledge base of self-hosted/enterprise failure modes and an error-string lookup for faster matching.
- Extends
docs/troubleshooting.mdwith self-hosted runner diagnostic guidance and adds a workflow config test to keep source/import/lock aligned.
Show a summary per file
| File | Description |
|---|---|
| scripts/ci/self-hosted-runner-doctor-workflow.test.ts | Adds Jest assertions to ensure the new workflow, shared import, and lock artifact stay aligned. |
| docs/troubleshooting.md | Adds new self-hosted runner troubleshooting sections (ARC/DinD, runner home, IPv6, proxies, GHES/GHEC routing). |
| .github/workflows/shared/self-hosted-failure-modes.md | Introduces a structured failure-mode catalog + quick lookup table for triage mapping. |
| .github/workflows/self-hosted-runner-doctor.md | Defines the new /runner-doctor slash-command workflow and triage playbook. |
| .github/workflows/self-hosted-runner-doctor.lock.yml | Adds the compiled, pinned, ready-to-run lock workflow artifact. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 8/8 changed files
- Comments generated: 3
| | B1 | `/home/runner/...` paths are wrong on a custom runner home | The runner uses a non-standard `HOME` | Use the real `HOME`; when configuring stdin, set `chroot.identity.home` | `echo "$HOME"` and inspect mounted home paths | #2109, #2290 | | ||
| | B2 | All outbound traffic fails behind a mandatory corporate proxy | AWF must chain Squid through the upstream proxy | Set `https_proxy` / `http_proxy` on the host or use `--upstream-proxy` | `env | grep -i proxy`; inspect Squid config for `cache_peer` | #1975 | | ||
| | B3 | Squid exits with `FATAL: http_port: IPv6 is not available` | Docker IPv6 is disabled but Squid tries to bind an IPv6 listener | Enable Docker IPv6 or upgrade to a build that emits the listener conditionally | `docker info | grep -i ipv6`; inspect `/proc/sys/net/ipv6/conf/all/disable_ipv6` | #2139 | | ||
| | B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 | |
✅ Coverage Check PassedOverall Coverage
📁 Per-file Coverage Changes (1 files)
Coverage comparison generated by |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
✅ Copilot review passed with no inline comments. @copilot Add the |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
🚀 Security Guard has started processing this pull request |
|
✅ Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓 |
|
✅ Build Test Suite completed successfully! |
|
📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅ |
|
❌ Smoke Copilot BYOK AOAI (api-key) reports failed. AOAI BYOK (api-key) mode investigation needed... Failed safeoutputs calls for add_comment; no action taken |
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
🔌 Smoke Services — All services reachable! ✅ |
|
✅ Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓 |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
|
🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅ |
|
✅ Smoke Gemini completed. All facets verified. 💎 Testing safeoutputs |
|
Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded. |
|
❌ Contribution Check failed. Please review the logs for details. |
|
✅ Smoke Claude passed |
🔬 Smoke Test: Copilot PAT Auth — PASS
Auth mode: PAT (COPILOT_GITHUB_TOKEN) cc
|
Smoke Test: Claude Engine
Overall result: PASS
|
Copilot BYOK Smoke Test Results✅ GitHub MCP Testing — list_pull_requests verified Status: PASS cc/
|
🔬 Smoke Test ResultsPR: Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage
Overall: FAIL — pre-agent step outputs were not passed to agent (template variables unexpanded)
|
Smoke Test: API Proxy OpenTelemetry Tracing
All scenarios pass. OTEL tracing integration is fully operational.
|
Smoke Test Results
PR Titles:
Overall status: FAIL Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
|
Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage ✅ GitHub reads
|
Chroot Version Comparison Results
Overall: FAILED — Python and Node.js versions differ between host and chroot environments. The
|
|
GitHub MCP Testing: ✅ Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra Overall: PASS
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
Smoke Test: GitHub Actions Services Connectivity
Overall: FAIL AWF firewall explicitly blocks ports 6379 (Redis) and 5432 (PostgreSQL) as "dangerous ports" in
|
AWF’s self-hosted failure modes were documented across many issues but not captured in a reusable diagnostic workflow. This change adds a dedicated Runner Doctor agent for user-reported ARC/DinD, GHES/GHEC, and self-hosted runner problems, plus targeted troubleshooting guidance.
New agentic workflow
self-hosted-runner-doctor.mdas a/runner-doctorissue-comment workflow for self-hosted and enterprise runner diagnostics.DOCKER_HOST, ARC fingerprints,GITHUB_SERVER_URL, runner home, libc/runtime, and Docker IPv6 state.Shared failure-mode knowledge base
shared/self-hosted-failure-modes.mdwith a structured catalog covering:GH_HOSTfailuresCompiled workflow artifact
Troubleshooting docs
docs/troubleshooting.mdwith practical sections for:Workflow coverage
Example diagnostic pattern now embedded in the workflow: