Skip to content

Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage#5460

Merged
lpcox merged 6 commits into
mainfrom
copilot/self-hosted-runner-doctor
Jun 23, 2026
Merged

Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage#5460
lpcox merged 6 commits into
mainfrom
copilot/self-hosted-runner-doctor

Conversation

Copilot AI commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

AWF’s self-hosted failure modes were documented across many issues but not captured in a reusable diagnostic workflow. This change adds a dedicated Runner Doctor agent for user-reported ARC/DinD, GHES/GHEC, and self-hosted runner problems, plus targeted troubleshooting guidance.

  • New agentic workflow

    • Adds self-hosted-runner-doctor.md as a /runner-doctor issue-comment workflow for self-hosted and enterprise runner diagnostics.
    • Uses read-only permissions with safe outputs for triage comments/issues.
    • Bakes in a platform-first diagnostic flow: DOCKER_HOST, ARC fingerprints, GITHUB_SERVER_URL, runner home, libc/runtime, and Docker IPv6 state.
  • Shared failure-mode knowledge base

    • Adds shared/self-hosted-failure-modes.md with a structured catalog covering:
      • ARC/DinD split-filesystem and chroot failures
      • self-hosted runner home/toolcache/proxy issues
      • GHES/GHEC/data-residency routing and GH_HOST failures
      • known unresolved runtime gaps
    • Includes an error-string lookup table so the workflow can map symptoms to likely failure modes quickly.
  • Compiled workflow artifact

    • Adds the generated lock file so the new workflow is ready to run under the existing gh-aw conventions.
  • Troubleshooting docs

    • Extends docs/troubleshooting.md with practical sections for:
      • ARC/DinD split filesystems
      • non-standard runner homes
      • IPv6-disabled Docker/Squid failures
      • corporate upstream proxies
      • GHES/GHEC enterprise routing issues
  • Workflow coverage

    • Adds a focused workflow config test to keep the source, shared import, and compiled lock file aligned.

Example diagnostic pattern now embedded in the workflow:

printenv DOCKER_HOST ACTIONS_RUNNER_POD_NAME ACTIONS_RUNNER_CONTAINER_HOOKS GITHUB_SERVER_URL HOME GH_HOST

docker info 2>/dev/null | grep -Ei 'Runtimes|Default Runtime|IPv6|Docker Root Dir'

mkdir -p /tmp/gh-aw/agent
SENTINEL="/tmp/gh-aw/agent/awf-runner-doctor-$$"
echo ok > "$SENTINEL"
docker run --rm -v /tmp:/tmp alpine sh -lc "ls -l $SENTINEL" 2>/dev/null

Copilot AI changed the title [WIP] Propose Self-Hosted Runner Doctor agent for AWF diagnostics Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage Jun 23, 2026
Copilot finished work on behalf of lpcox June 23, 2026 23:18
Copilot AI requested a review from lpcox June 23, 2026 23:18
@lpcox lpcox marked this pull request as ready for review June 23, 2026 23:31
Copilot AI review requested due to automatic review settings June 23, 2026 23:31
Add a scheduled agentic workflow that reviews self-hosted, ARC/DinD,
GHEC, and GHES issues and PRs updated since the previous run and proposes
knowledge-base updates for the Self-Hosted Runner Doctor as a single new
issue. Includes a compiled lock file and a config test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Documentation Preview

Documentation build failed for this PR. View logs.

Built from commit e5923b7

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reusable “Self-Hosted Runner Doctor” slash-command agentic workflow to triage ARC/DinD + enterprise/self-hosted runner failure reports, backed by a shared failure-mode catalog and expanded troubleshooting documentation.

Changes:

  • Introduces /runner-doctor agentic workflow source + compiled lock workflow for immediate use.
  • Adds a shared knowledge base of self-hosted/enterprise failure modes and an error-string lookup for faster matching.
  • Extends docs/troubleshooting.md with self-hosted runner diagnostic guidance and adds a workflow config test to keep source/import/lock aligned.
Show a summary per file
File Description
scripts/ci/self-hosted-runner-doctor-workflow.test.ts Adds Jest assertions to ensure the new workflow, shared import, and lock artifact stay aligned.
docs/troubleshooting.md Adds new self-hosted runner troubleshooting sections (ARC/DinD, runner home, IPv6, proxies, GHES/GHEC routing).
.github/workflows/shared/self-hosted-failure-modes.md Introduces a structured failure-mode catalog + quick lookup table for triage mapping.
.github/workflows/self-hosted-runner-doctor.md Defines the new /runner-doctor slash-command workflow and triage playbook.
.github/workflows/self-hosted-runner-doctor.lock.yml Adds the compiled, pinned, ready-to-run lock workflow artifact.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 8/8 changed files
  • Comments generated: 3

Comment thread .github/workflows/shared/self-hosted-failure-modes.md Outdated
| B1 | `/home/runner/...` paths are wrong on a custom runner home | The runner uses a non-standard `HOME` | Use the real `HOME`; when configuring stdin, set `chroot.identity.home` | `echo "$HOME"` and inspect mounted home paths | #2109, #2290 |
| B2 | All outbound traffic fails behind a mandatory corporate proxy | AWF must chain Squid through the upstream proxy | Set `https_proxy` / `http_proxy` on the host or use `--upstream-proxy` | `env | grep -i proxy`; inspect Squid config for `cache_peer` | #1975 |
| B3 | Squid exits with `FATAL: http_port: IPv6 is not available` | Docker IPv6 is disabled but Squid tries to bind an IPv6 listener | Enable Docker IPv6 or upgrade to a build that emits the listener conditionally | `docker info | grep -i ipv6`; inspect `/proc/sys/net/ipv6/conf/all/disable_ipv6` | #2139 |
| B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 |
Comment thread docs/troubleshooting.md Outdated
@github-actions

Copy link
Copy Markdown
Contributor

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 98.06% 98.10% 📈 +0.04%
Statements 98.00% 98.03% 📈 +0.03%
Functions 99.52% 99.52% ➡️ +0.00%
Branches 93.75% 93.78% 📈 +0.03%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/workdir-setup.ts 92.7% → 94.5% (+1.82%) 92.7% → 94.5% (+1.82%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

✅ Copilot review passed with no inline comments.

@copilot Add the ready-for-aw label to this PR to trigger agentic CI smoke tests.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🚀 Security Guard has started processing this pull request

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Build Test Suite completed successfully!

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (api-key) reports failed. AOAI BYOK (api-key) mode investigation needed...

Failed safeoutputs calls for add_comment; no action taken

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

🔌 Smoke Services — All services reachable! ✅

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Smoke Gemini completed. All facets verified. 💎

Testing safeoutputs

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Contribution Check failed. Please review the logs for details.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Smoke Claude passed

@github-actions

Copy link
Copy Markdown
Contributor

🔬 Smoke Test: Copilot PAT Auth — PASS

Test Result
GitHub MCP connectivity
GitHub.com HTTP ✅ 200
File write/read

Auth mode: PAT (COPILOT_GITHUB_TOKEN)

cc @Copilot @lpcox

🔑 PAT report filed by Smoke Copilot PAT

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Claude Engine

  • API status: ✅ PASS
  • gh check: ✅ PASS
  • File status: ✅ PASS

Overall result: PASS

Generated by Smoke Claude for issue #5460 · 61.1 AIC · ⊞ 3.1K ·

@github-actions

Copy link
Copy Markdown
Contributor

Copilot BYOK Smoke Test Results

✅ GitHub MCP Testing — list_pull_requests verified
✅ GitHub.com Connectivity — HTTP 200
✅ File Write/Read Test — smoke-test-copilot-byok.txt found
✅ BYOK Inference Test — api-proxy sidecar → api.githubcopilot.com working

Status: PASS
Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY) via api-proxy → api.githubcopilot.com

cc/ @lpcox @Copilot

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions

Copy link
Copy Markdown
Contributor

🔬 Smoke Test Results

PR: Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage
Author: @Copilot | Assignees: @lpcox, @Copilot

Test Result
GitHub MCP connectivity
GitHub.com HTTP connectivity ✅ 200
File write/read ❌ template vars unexpanded
Pre-fetched PR data ❌ template vars unexpanded

Overall: FAIL — pre-agent step outputs were not passed to agent (template variables unexpanded)

📰 BREAKING: Report filed by Smoke Copilot

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: API Proxy OpenTelemetry Tracing

Scenario Result Details
1. Module Loading otel.js loads; exports startRequestSpan, setTokenAttributes, setBudgetAttributes, endSpan, endSpanError, shutdown, isEnabled; isEnabled() = true
2. Test Suite 59/59 tests passed (2 suites: otel.test.js, otel-fanout.test.js) — spans, attributes, exporters, serialization, fan-out, shutdown all covered
3. Env Var Forwarding api-proxy-env-config.ts forwards GH_AW_OTLP_ENDPOINTS, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, GITHUB_AW_OTEL_TRACE_ID, GITHUB_AW_OTEL_PARENT_SPAN_ID, OTEL_SERVICE_NAME
4. Token Tracker Integration onUsage callback present in token-tracker-http.js (line 283/324); proxy-request.js calls startRequestSpan/setTokenAttributes/endSpan/endSpanError
5. OTEL Diagnostics ✅ (N/A) No live container — FileSpanExporter fallback to /var/log/api-proxy/otel.jsonl confirmed via test; no spans exported in static analysis run (expected)

All scenarios pass. OTEL tracing integration is fully operational.

📡 OTel tracing validated by Smoke OTel Tracing

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test Results

  • GitHub MCP: ❌ (Tools not found)
  • GitHub Connectivity: ❌ (SSL Error 35)
  • File Writing: ✅
  • Bash Tool: ✅

PR Titles:

  1. Refactor api-proxy isolation (Refactor api-proxy credential isolation into per-provider env builders #5442)
  2. (Unknown)

Overall status: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

@github-actions

Copy link
Copy Markdown
Contributor

Add Self-Hosted Runner Doctor workflow for ARC/DinD and enterprise AWF triage
Refactor api-proxy credential isolation into per-provider env builders
Refactor api-proxy service config into env and lifecycle builders

✅ GitHub reads
✅ Playwright title
✅ File write/read
✅ Build
✅ Discussion comment
Overall status: PASS

🔮 The oracle has spoken through Smoke Codex

@github-actions

Copy link
Copy Markdown
Contributor

Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3 ❌ NO
Node.js v24.16.0 v22.22.3 ❌ NO
Go go1.22.12 go1.22.12 ✅ YES

Overall: FAILED — Python and Node.js versions differ between host and chroot environments. The smoke-chroot label was not applied.

Tested by Smoke Chroot

@github-actions

Copy link
Copy Markdown
Contributor

@lpcox @Copilot
Pre-fetched PR Data titles: Refactor host-access validation tests to remove duplicated exit/assert scaffolding; refactor(test): extract runAuditFilter helper to deduplicate audit filter tests

GitHub MCP Testing: ✅
GitHub.com Connectivity: ✅
File Write/Read Test: ✅
BYOK Inference Test: ✅

Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra

Overall: PASS

🪪 BYOK (AOAI Entra) report filed by Smoke Copilot BYOK AOAI (Entra)

@github-actions

Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color passed ✅ PASS
Go env passed ✅ PASS
Go uuid passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx all passed ✅ PASS
Node.js execa all passed ✅ PASS
Node.js p-limit all passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #5460 · 48.3 AIC · ⊞ 7.7K ·

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: GitHub Actions Services Connectivity

Check Result Detail
Redis PING Connection timed out (port 6379 blocked by AWF iptables)
PostgreSQL pg_isready No response (port 5432 blocked by AWF iptables)
PostgreSQL SELECT 1 Failed (port 5432 blocked by AWF iptables)

Overall: FAIL

AWF firewall explicitly blocks ports 6379 (Redis) and 5432 (PostgreSQL) as "dangerous ports" in setup-iptables.sh. Services on host.docker.internal are unreachable from inside the sandbox.

🔌 Service connectivity validated by Smoke Services

@lpcox lpcox merged commit 6d26488 into main Jun 23, 2026
87 of 90 checks passed
@lpcox lpcox deleted the copilot/self-hosted-runner-doctor branch June 23, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants