agent-evaluation

Here are 13 public repositories matching this topic...

Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Aug 1, 2025
Python

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Aug 15, 2025
Go

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Aug 15, 2025
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Aug 15, 2025
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Aug 13, 2025
Jupyter Notebook

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Jul 25, 2025
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation: stress testing, network resilience, ensemble coordination, failure analysis. Features statistical validation and reproducible methodology for separating architectural theater from real systems.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

shiragannavar / Testing-RAG

Star

evaluation ground-truth llm generative-ai agent-evaluation

Updated May 12, 2025
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

lml2468 / ContextOptimizer

Star

Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights

multi-agent-systems prompt-engineering agent-evaluation context-engineering agent-optimizer

Updated Jul 5, 2025
Python

smuddana-7 / Cart-Pole-Gymnasium-Environment

Star

Train a reinforcement learning agent using PPO to balance a pole on a cart in the CartPole-v0 environment using Gymnasium and Stable-Baselines3. Includes model training, evaluation, and rendering using Python and Jupyter Notebook.

reinforcement-learning python3 agent-evaluation openai-gymnasium cartpole-v0-environment

Updated Jun 2, 2025
Jupyter Notebook

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Star

Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ

nvidia multi-agent-systems model-comparison production-ai rag streamlit trustworthy-ai llmops genai enterprise-ai llm-evaluation open-source-ai agent-evaluation agentiq pipeline-evaluation

Updated Apr 10, 2025
Python

Sai-Santhan-Dodda / ai-navigation-automation

Star

Browser automation agent for Bunnings website using the browser-use library, orchestrated via the laminar framework, managed with uv for Python environments, and running in Brave Browser for stealth and CAPTCHA bypass.

python automation gemini openai brave browser-automation laminar uv llms ollama stealth-browsing browser-use agent-evaluation

Updated Aug 10, 2025
Python

Improve this page

Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 13 public repositories matching this topic...

Giskard-AI / giskard

coze-dev / coze-loop

truera / trulens

mozilla-ai / any-agent

rungalileo / agent-leaderboard

SparkBeyond / agentune

Cre4T3Tiv3 / ai-agents-reality-check

shiragannavar / Testing-RAG

chaosync-org / awesome-ai-agent-testing

lml2468 / ContextOptimizer

smuddana-7 / Cart-Pole-Gymnasium-Environment

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Sai-Santhan-Dodda / ai-navigation-automation

Improve this page

Add this topic to your repo