Outcome driven agent development framework that evolves
-
Updated
Feb 15, 2026 - Python
Outcome driven agent development framework that evolves
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
build and benchmark deep research
Ranking LLMs on agentic tasks
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Code scanner to check for issues in prompts and LLM calls
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.
Example Projects integrated with Future AGI Tech Stack for easy AI development
FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
Running UK AISI's Inspect in the Cloud
AI Red Team Operations Console
Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.
Cost-of-Pass: An Economic Framework for Evaluating Language Models
🛡️ Safe AI Agents through Action Classifier
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."