Skip to content

Agent Learning Kit v1 + unified benchmark harness#49

Open
nik13 wants to merge 5 commits into
devfrom
release/v1-agent-learning-kit-dev
Open

Agent Learning Kit v1 + unified benchmark harness#49
nik13 wants to merge 5 commits into
devfrom
release/v1-agent-learning-kit-dev

Conversation

@nik13

@nik13 nik13 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Agent Learning Kit v1 + unified benchmark harness

This PR brings the complete, gate-green Agent Learning Kit v1 to dev, plus a
new unified benchmark harness. The package is shippable as agent-learning-kit
(import agent_learning, CLI agent-learn), with the engine vendored under fi.

What's in it

  • v1 kit: simulation, evaluation, optimization, red-teaming, and the live
    framework loops (LiveKit / Pipecat / LangChain / MCP / A2A) — all credential-free
    in the release gates, live only behind opt-in env flags.
  • Dashboard run-telemetry (W&B / promptfoo-style): keyed runs deep-link to the
    dashboard, otherwise a local log; no hosted-service requirement in any gate.
  • Unified benchmark harness (agent-learn bench / agent_learning.bench):
    one Task <-> Verifier contract and one unified Result across modalities.
    • Control modes: push (harness drives the agent over a task dataset),
      artifact_in (submit-and-score against a held-out oracle), and pull
      (the agent drives a live env via reset/step — the Gym/OpenEnv shape).
    • Coding lane, hardened: command/artifact-graded. The candidate produces
      files/output; a held-out grader runs after the candidate is killed, and the
      verdict is the grader's exit code + a grader-owned reward file (never candidate
      stdout). Structurally resists output-forging and oracle-read. Multi-language.
      Two sandboxes: credential-free subprocess (the gate tier) and an opt-in,
      network-isolated, capability-dropped, non-root, read-only Docker lane.
    • Voice lane: transcript verifier (latency / turn-taking / barge-in / content).
    • Pull/RL lane: in-process, deterministic env registry (reference + noop
      policies prove solvability); a live external env server plugs in unchanged.
    • Honesty primitives: every row carries execution_class / evidence_class;
      infra failures are recorded as honest void rows (excluded from pass-rate),
      never as agent failures; an overclaim tripwire + reward-hack detector.

Quality / validation

  • Release-check 81/81 green (gate #81 bench_contract_readiness audits the full
    suite: reference-pass, discrimination, determinism, held-out oracle, guards,
    command-graded, pull, voice).
  • Full test suite green; ruff, mypy, and bandit clean on the bench code
    (Bandit's subprocess-family findings are configured out with documented rationale —
    this kit runs candidate code by design; isolation lives in the Docker lane).
  • No internal planning docs and no competitor names in the shipped tree.

🤖 Generated with Claude Code


Documentation & cookbooks (added)

The bench harness is now fully documented in the kit's gate-enforced docs system —
8 new pages, each with an executable twin under examples/, admitted by the
docs_executability release gate:

  • eval/benchmark-overview · eval/benchmark-coding · eval/benchmark-command-graded
    · eval/benchmark-sandboxes · eval/benchmark-voice · eval/benchmark-pull-rl
    · eval/benchmark-write-a-suite · prove/benchmark-in-ci

Plus cookbook-matrix entries, a regenerated byte-exact docs/llms.txt, and the six
previously-missing CLI rows in reference/cli.md (bench / persona / scenario
/ simulation / practice / runs). An adversarial review pass corrected two
accuracy defects before merge. Re-verified: release-check 81/81, ruff clean,
125 docs/bench tests pass.


Import namespace: agent_learningfi.alk

The public SDK now imports as fi.alk, under the existing fi Future AGI
namespace package — alongside the vendored engine (fi.evals / fi.simulate /
fi.opt), so the whole kit composes under one fi namespace.

Unchanged public surface: distribution name agent-learning-kit, CLI
agent-learn, artifact-kind strings (agent-learning.*.v1), and the
AGENT_LEARNING_* environment variables. Only the Python import path moves
(import agent_learningimport fi.alk).

Mechanics: git mv src/agent_learning → src/fi/alk (history preserved);
443-file deterministic whole-word rewrite; release gates updated for the nested
layout (single src/fi wheel root, deduped scan-roots). Validation:
release-check 81/81, full suite 1139 tests green, ruff clean,
agent-learn doctor passes.

nik13 and others added 2 commits June 10, 2026 17:10
Snapshots the current release/v1-agent-learning-kit tree (the complete, gate-green
v1 kit) onto this dev-based branch, in place. Over the prior snapshot this adds:

- Dashboard run-telemetry (W&B/promptfoo-style) + the live framework-loop examples.
- The UNIFIED BENCHMARK HARNESS (agent-learn bench / agent_learning.bench): one
  Task<->Verifier contract; push / artifact-in / pull control modes; a unified
  Result. Coding lane hardened (command/artifact-graded: held-out grader runs
  after the candidate, verdict = grader exit + reward file, forge + oracle-read
  resistant; multi-language; subprocess + opt-in network-isolated Docker). Pull/RL
  env registry. Voice transcript verifier. Gate #81 bench_contract_readiness.
- PR-review hardening pass + coding-standards pass (ruff + mypy + bandit clean).

Validation: release-check 81/81; full test suite green; ruff/mypy/bandit clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nik13 nik13 changed the title Prepare Agent Learning Kit v1 release branch Agent Learning Kit v1 + unified benchmark harness Jun 23, 2026
nik13 and others added 3 commits June 23, 2026 00:55
…k 81/81)

Snapshots the bench documentation buildout onto PR #49: 8 pages + 8 fresh-lane
example twins, cookbook matrix entries, regenerated docs/llms.txt, and the six
missing CLI reference rows. release-check 81/81; ruff clean; 125 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prose-only fixes: benchmark-coding overclaim, benchmark-in-ci error-message
fidelity, and the simulation CLI row (lift/validate/run). docs gate re-verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Public import is now fi.alk (under the fi namespace package), alongside the
vendored engine (fi.evals/fi.simulate/fi.opt). Distribution name
agent-learning-kit, CLI agent-learn, artifact kinds (agent-learning.*.v1) and
AGENT_LEARNING_* env vars are unchanged; only the import path moves.

git mv src/agent_learning -> src/fi/alk; 443-file deterministic rewrite; release
gates updated for the nested layout. Validation: release-check 81/81; full suite
1139 tests green; ruff clean; agent-learn doctor passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant