Uh oh!

There was an error while loading. Please reload this page.

vercel-labs / agent-eval Public

Notifications You must be signed in to change notification settings
Fork 25
Star 204

Code
Issues 6
Pull requests 17
Actions
Projects
Security and quality
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security and quality
Insights

Pull requests: vercel-labs/agent-eval

Labels 10 Milestones 0

New pull request New

17 Open 130 Closed

Author

Filter by author

Uh oh!

There was an error while loading. Please reload this page.

Label

Filter by label

Uh oh!

There was an error while loading. Please reload this page.

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Uh oh!

There was an error while loading. Please reload this page.

Milestones

Filter by milestone

Uh oh!

There was an error while loading. Please reload this page.

Reviews

Filter by reviews

No reviews Review required Approved review Changes requested

Assignee

Filter by who’s assigned

Assigned to nobody

Uh oh!

There was an error while loading. Please reload this page.

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Pull requests list

[judge] Agentic LLM judge: judge codebases and transcripts

#161 opened Jun 26, 2026 by gaojude Collaborator

Loading…

[agents] Plugin model: per-agent definition + in-sandbox runner

#160 opened Jun 26, 2026 by gaojude Collaborator

Loading…

fix(agent-eval): delegate demultiplexing to docker-modem's docker demuxStream()

#158 opened Jun 26, 2026 by huang-julien

Loading…

fix: reuse cached results for single-experiment runs

#153 opened Jun 11, 2026 by benjamincanac

Loading…

7 tasks done

[Agents] Add Mistral Vibe CLI agent with direct API and streaming-json transcript support

#137 opened May 21, 2026 by joeyspagnoli

Loading…

feat: add max turns run budget

#135 opened May 15, 2026 by EfeDurmaz16

Loading…

fix(results): discover nested eval result dirs

#133 opened May 15, 2026 by EfeDurmaz16

Loading…

feat: add experiment trend analyzer

#132 opened May 14, 2026 by EfeDurmaz16

Loading…

fix: discover nested eval result directories

#131 opened May 14, 2026 by EfeDurmaz16

Loading…

feat: improve names of Docker sandbox containers

#128 opened May 8, 2026 by jonahsnider

Loading…

fix: clean up Docker sandbox containers on exit

#127 opened May 8, 2026 by jonahsnider

Loading…

Skip npm install when package.json is absent

#125 opened May 6, 2026 by allenzhou101 Contributor

Loading…

[Kimi] Add Kimi CLI agent harness

#117 opened Apr 20, 2026 by gaojude Collaborator • Draft

2 of 4 tasks

Skip missing validation scripts

#92 opened Mar 17, 2026 by gaojude Collaborator

Loading…

[wip] add bub agent support

#91 opened Mar 7, 2026 by CorrectRoadH • Draft

Add timings for phases

#88 opened Feb 25, 2026 by jeffsee55

Loading…

Add ability to choose which eval --smoke runs

#84 opened Feb 20, 2026 by jeffsee55

Loading…

ProTip! Mix and match filters to narrow down what you’re looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!