ai-evaluation

Here are 125 public repositories matching this topic...

adenhq / hive

Outcome driven agent development framework that evolves

python agent automation awesome self-hosted openai autonomous-agents human-in-the-loop claude agent-framework self-improving ai-evaluation anthropic agent-skills claude-code self-improving-ai self-improving-agent observability-ai

Updated Feb 15, 2026
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Feb 12, 2026
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

guestrin-lab / deepscholar

Star

build and benchmark deep research

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Feb 14, 2026
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Nov 18, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 13, 2026
TypeScript

taoAIGC / AICompare

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Jan 30, 2026
JavaScript

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Vvkmnn / awesome-ai-eval

Star

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

Updated Feb 12, 2026

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

HZYAI / RagScore

Star

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

privacy jupyter mcp evaluation colab dataset-generation synthetic-data fine-tuning rag qa-generation ai-evaluation llm llmops local-llm ollama rag-evaluation llm-as-a-judge

Updated Feb 15, 2026
Python

future-agi / cookbooks

Star

Example Projects integrated with Future AGI Tech Stack for easy AI development

finance marketing development evaluation interview cookbooks healthcare ai-agents mlops ai-evaluation rag-chatbot agentic-ai

Updated Dec 8, 2025
Python

HiThink-Research / FinMTM

Star

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

finance benchmark financial-analysis ai-evaluation ai-benchmarking financial-llm

Updated Feb 6, 2026
Python

METR / inspect-action

Star

Running UK AISI's Inspect in the Cloud

ai inspect elicitation ai-evaluation evals

Updated Feb 15, 2026
Python

Arnoldlarry15 / ARES-Dashboard

Star

AI Red Team Operations Console

nlp machine-learning jwt typescript ai frontend backend full-stack auth0 api-security red-teaming ai-security responsible-ai trustworthy-ai ai-evaluation llm model-auditing

Updated Jan 29, 2026
TypeScript

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Jan 22, 2026
Python

firstlinesoftware / eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework