Agent Learning Kit v1 + unified benchmark harness by nik13 · Pull Request #49 · future-agi/agent-learning-kit

nik13 · 2026-06-11T00:11:23Z

Agent Learning Kit v1 + unified benchmark harness

This PR brings the complete, gate-green Agent Learning Kit v1 to dev, plus a
new unified benchmark harness. The package is shippable as agent-learning-kit
(import agent_learning, CLI agent-learn), with the engine vendored under fi.

What's in it

v1 kit: simulation, evaluation, optimization, red-teaming, and the live
framework loops (LiveKit / Pipecat / LangChain / MCP / A2A) — all credential-free
in the release gates, live only behind opt-in env flags.
Dashboard run-telemetry (W&B / promptfoo-style): keyed runs deep-link to the
dashboard, otherwise a local log; no hosted-service requirement in any gate.
Unified benchmark harness (agent-learn bench / agent_learning.bench):
one Task <-> Verifier contract and one unified Result across modalities.
- Control modes: push (harness drives the agent over a task dataset),
  artifact_in (submit-and-score against a held-out oracle), and pull
  (the agent drives a live env via reset/step — the Gym/OpenEnv shape).
- Coding lane, hardened: command/artifact-graded. The candidate produces
  files/output; a held-out grader runs after the candidate is killed, and the
  verdict is the grader's exit code + a grader-owned reward file (never candidate
  stdout). Structurally resists output-forging and oracle-read. Multi-language.
  Two sandboxes: credential-free subprocess (the gate tier) and an opt-in,
  network-isolated, capability-dropped, non-root, read-only Docker lane.
- Voice lane: transcript verifier (latency / turn-taking / barge-in / content).
- Pull/RL lane: in-process, deterministic env registry (reference + noop
  policies prove solvability); a live external env server plugs in unchanged.
- Honesty primitives: every row carries execution_class / evidence_class;
  infra failures are recorded as honest void rows (excluded from pass-rate),
  never as agent failures; an overclaim tripwire + reward-hack detector.

Quality / validation

Release-check 81/81 green (gate #81 bench_contract_readiness audits the full
suite: reference-pass, discrimination, determinism, held-out oracle, guards,
command-graded, pull, voice).
Full test suite green; ruff, mypy, and bandit clean on the bench code
(Bandit's subprocess-family findings are configured out with documented rationale —
this kit runs candidate code by design; isolation lives in the Docker lane).
No internal planning docs and no competitor names in the shipped tree.

🤖 Generated with Claude Code

Documentation & cookbooks (added)

The bench harness is now fully documented in the kit's gate-enforced docs system —
8 new pages, each with an executable twin under examples/, admitted by the
docs_executability release gate:

eval/benchmark-overview · eval/benchmark-coding · eval/benchmark-command-graded
· eval/benchmark-sandboxes · eval/benchmark-voice · eval/benchmark-pull-rl
· eval/benchmark-write-a-suite · prove/benchmark-in-ci

Plus cookbook-matrix entries, a regenerated byte-exact docs/llms.txt, and the six
previously-missing CLI rows in reference/cli.md (bench / persona / scenario
/ simulation / practice / runs). An adversarial review pass corrected two
accuracy defects before merge. Re-verified: release-check 81/81, ruff clean,
125 docs/bench tests pass.

Import namespace: `agent_learning` → `fi.alk`

The public SDK now imports as fi.alk, under the existing fi Future AGI
namespace package — alongside the vendored engine (fi.evals / fi.simulate /
fi.opt), so the whole kit composes under one fi namespace.

Unchanged public surface: distribution name agent-learning-kit, CLI
agent-learn, artifact-kind strings (agent-learning.*.v1), and the
AGENT_LEARNING_* environment variables. Only the Python import path moves
(import agent_learning → import fi.alk).

Mechanics: git mv src/agent_learning → src/fi/alk (history preserved);
443-file deterministic whole-word rewrite; release gates updated for the nested
layout (single src/fi wheel root, deduped scan-roots). Validation:
release-check 81/81, full suite 1139 tests green, ruff clean,
agent-learn doctor passes.

Snapshots the current release/v1-agent-learning-kit tree (the complete, gate-green v1 kit) onto this dev-based branch, in place. Over the prior snapshot this adds: - Dashboard run-telemetry (W&B/promptfoo-style) + the live framework-loop examples. - The UNIFIED BENCHMARK HARNESS (agent-learn bench / agent_learning.bench): one Task<->Verifier contract; push / artifact-in / pull control modes; a unified Result. Coding lane hardened (command/artifact-graded: held-out grader runs after the candidate, verdict = grader exit + reward file, forge + oracle-read resistant; multi-language; subprocess + opt-in network-isolated Docker). Pull/RL env registry. Voice transcript verifier. Gate #81 bench_contract_readiness. - PR-review hardening pass + coding-standards pass (ruff + mypy + bandit clean). Validation: release-check 81/81; full test suite green; ruff/mypy/bandit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…k 81/81) Snapshots the bench documentation buildout onto PR #49: 8 pages + 8 fresh-lane example twins, cookbook matrix entries, regenerated docs/llms.txt, and the six missing CLI reference rows. release-check 81/81; ruff clean; 125 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prose-only fixes: benchmark-coding overclaim, benchmark-in-ci error-message fidelity, and the simulation CLI row (lift/validate/run). docs gate re-verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Public import is now fi.alk (under the fi namespace package), alongside the vendored engine (fi.evals/fi.simulate/fi.opt). Distribution name agent-learning-kit, CLI agent-learn, artifact kinds (agent-learning.*.v1) and AGENT_LEARNING_* env vars are unchanged; only the import path moves. git mv src/agent_learning -> src/fi/alk; 443-file deterministic rewrite; release gates updated for the nested layout. Validation: release-check 81/81; full suite 1139 tests green; ruff clean; agent-learn doctor passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nik13 and others added 2 commits June 10, 2026 17:10

Prepare Agent Learning Kit v1 release

daea7ef

nik13 changed the title ~~Prepare Agent Learning Kit v1 release branch~~ Agent Learning Kit v1 + unified benchmark harness Jun 23, 2026

nik13 and others added 3 commits June 23, 2026 00:55

docs: fix accuracy faults in bench docs (adversarial sweep)

d267606

Prose-only fixes: benchmark-coding overclaim, benchmark-in-ci error-message fidelity, and the simulation CLI row (lift/validate/run). docs gate re-verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent Learning Kit v1 + unified benchmark harness#49

Agent Learning Kit v1 + unified benchmark harness#49
nik13 wants to merge 5 commits into
devfrom
release/v1-agent-learning-kit-dev

nik13 commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nik13 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!