Agent Learning Kit v1 + unified benchmark harness#49
Open
nik13 wants to merge 5 commits into
Open
Conversation
Snapshots the current release/v1-agent-learning-kit tree (the complete, gate-green v1 kit) onto this dev-based branch, in place. Over the prior snapshot this adds: - Dashboard run-telemetry (W&B/promptfoo-style) + the live framework-loop examples. - The UNIFIED BENCHMARK HARNESS (agent-learn bench / agent_learning.bench): one Task<->Verifier contract; push / artifact-in / pull control modes; a unified Result. Coding lane hardened (command/artifact-graded: held-out grader runs after the candidate, verdict = grader exit + reward file, forge + oracle-read resistant; multi-language; subprocess + opt-in network-isolated Docker). Pull/RL env registry. Voice transcript verifier. Gate #81 bench_contract_readiness. - PR-review hardening pass + coding-standards pass (ruff + mypy + bandit clean). Validation: release-check 81/81; full test suite green; ruff/mypy/bandit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k 81/81) Snapshots the bench documentation buildout onto PR #49: 8 pages + 8 fresh-lane example twins, cookbook matrix entries, regenerated docs/llms.txt, and the six missing CLI reference rows. release-check 81/81; ruff clean; 125 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prose-only fixes: benchmark-coding overclaim, benchmark-in-ci error-message fidelity, and the simulation CLI row (lift/validate/run). docs gate re-verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Public import is now fi.alk (under the fi namespace package), alongside the vendored engine (fi.evals/fi.simulate/fi.opt). Distribution name agent-learning-kit, CLI agent-learn, artifact kinds (agent-learning.*.v1) and AGENT_LEARNING_* env vars are unchanged; only the import path moves. git mv src/agent_learning -> src/fi/alk; 443-file deterministic rewrite; release gates updated for the nested layout. Validation: release-check 81/81; full suite 1139 tests green; ruff clean; agent-learn doctor passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent Learning Kit v1 + unified benchmark harness
This PR brings the complete, gate-green Agent Learning Kit v1 to
dev, plus anew unified benchmark harness. The package is shippable as
agent-learning-kit(import
agent_learning, CLIagent-learn), with the engine vendored underfi.What's in it
framework loops (LiveKit / Pipecat / LangChain / MCP / A2A) — all credential-free
in the release gates, live only behind opt-in env flags.
dashboard, otherwise a local log; no hosted-service requirement in any gate.
agent-learn bench/agent_learning.bench):one
Task <-> Verifiercontract and one unifiedResultacross modalities.push(harness drives the agent over a task dataset),artifact_in(submit-and-score against a held-out oracle), andpull(the agent drives a live env via
reset/step— the Gym/OpenEnv shape).files/output; a held-out grader runs after the candidate is killed, and the
verdict is the grader's exit code + a grader-owned reward file (never candidate
stdout). Structurally resists output-forging and oracle-read. Multi-language.
Two sandboxes: credential-free
subprocess(the gate tier) and an opt-in,network-isolated, capability-dropped, non-root, read-only Docker lane.
policies prove solvability); a live external env server plugs in unchanged.
execution_class/evidence_class;infra failures are recorded as honest
voidrows (excluded from pass-rate),never as agent failures; an overclaim tripwire + reward-hack detector.
Quality / validation
bench_contract_readinessaudits the fullsuite: reference-pass, discrimination, determinism, held-out oracle, guards,
command-graded, pull, voice).
ruff,mypy, andbanditclean on the bench code(Bandit's subprocess-family findings are configured out with documented rationale —
this kit runs candidate code by design; isolation lives in the Docker lane).
🤖 Generated with Claude Code
Documentation & cookbooks (added)
The bench harness is now fully documented in the kit's gate-enforced docs system —
8 new pages, each with an executable twin under
examples/, admitted by thedocs_executabilityrelease gate:eval/benchmark-overview·eval/benchmark-coding·eval/benchmark-command-graded·
eval/benchmark-sandboxes·eval/benchmark-voice·eval/benchmark-pull-rl·
eval/benchmark-write-a-suite·prove/benchmark-in-ciPlus cookbook-matrix entries, a regenerated byte-exact
docs/llms.txt, and the sixpreviously-missing CLI rows in
reference/cli.md(bench/persona/scenario/
simulation/practice/runs). An adversarial review pass corrected twoaccuracy defects before merge. Re-verified: release-check 81/81, ruff clean,
125 docs/bench tests pass.
Import namespace:
agent_learning→fi.alkThe public SDK now imports as
fi.alk, under the existingfiFuture AGInamespace package — alongside the vendored engine (
fi.evals/fi.simulate/fi.opt), so the whole kit composes under onefinamespace.Unchanged public surface: distribution name
agent-learning-kit, CLIagent-learn, artifact-kind strings (agent-learning.*.v1), and theAGENT_LEARNING_*environment variables. Only the Python import path moves(
import agent_learning→import fi.alk).Mechanics:
git mv src/agent_learning → src/fi/alk(history preserved);443-file deterministic whole-word rewrite; release gates updated for the nested
layout (single
src/fiwheel root, deduped scan-roots). Validation:release-check 81/81, full suite 1139 tests green, ruff clean,
agent-learn doctorpasses.