Enhance evaluation functionality with support for multiple runs and d… by jayy-77 · Pull Request #4495 · google/adk-python

jayy-77 · 2026-02-14T14:07:50Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: Consistency improvements between adk eval and AgentEvaluator #4410
Related: #issue_number

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

…ynamic config discovery

gemini-code-assist · 2026-02-14T14:08:08Z

Summary of Changes

Hello @jayy-77, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation framework by introducing the ability to run evaluation cases multiple times, providing more statistically sound results. It also improves the flexibility of evaluation setup by allowing dynamic discovery of configuration files alongside test data, reducing the need for explicit global configuration. Furthermore, the changes broaden the compatibility of agent loading, supporting both synchronous and asynchronous agent initialization patterns.

Highlights

Multiple Evaluation Runs: Introduced a new --num_runs CLI option, allowing users to specify how many times each evaluation case should be executed, enhancing the robustness of evaluation results.
Dynamic Evaluation Config Discovery: Implemented a mechanism to dynamically discover test_config.json files adjacent to evaluation set files, enabling more flexible and localized evaluation criteria without requiring a global config file.
Flexible Agent Loading: Enhanced the get_root_agent function to support both direct root_agent exports and asynchronous get_agent_async methods from agent modules, improving compatibility with different agent implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/google/adk/cli/cli_eval.py
- Imported asyncio for asynchronous operations.
- Updated agent import from Agent to BaseAgent.
- Modified get_root_agent to support get_agent_async for asynchronous agent loading and changed its return type to BaseAgent.
- Refactored evaluation metric loading to support global and per-eval-set configurations.
- Implemented dynamic discovery of evaluation configurations for individual eval sets.
- Adjusted the evaluation loop to accommodate multiple runs and apply specific evaluation metrics per eval set.
src/google/adk/cli/cli_tools_click.py
- Added a --num_runs option to the cli_eval command, defaulting to 1.
- Included num_runs as a parameter in the cli_eval function signature.
- Imported discover_eval_config_for_test_file for dynamic configuration.
- Revised the logic for loading evaluation configurations, supporting global and dynamically discovered per-eval-set metrics.
- Integrated num_runs into the inference collection process, repeating inference requests as specified.
- Updated the evaluation results collection to process results based on dynamically assigned evaluation metrics per eval set.
src/google/adk/evaluation/agent_evaluator.py
- Imported discover_eval_config_for_test_file and EvalSetResultsManager.
- Refactored find_config_for_test_file to utilize the new discover_eval_config_for_test_file utility.
- Added an optional eval_set_results_manager parameter to evaluate_eval_set, evaluate, and _get_eval_results_by_eval_id functions to support persisting evaluation outputs.
src/google/adk/evaluation/eval_config.py
- Added a new function discover_eval_config_for_test_file to locate test_config.json files in the same directory as a given test file.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2026-02-14T14:09:39Z

Response from ADK Triaging Agent

Hello @jayy-77, thank you for your contribution!

To help us review this pull request, could you please fill out the template in the PR description? Specifically, we need:

A link to an existing issue or a description of the change.
A testing plan that describes how you've verified your changes.

This information is important for our review process as outlined in our contribution guidelines. Thanks!

gemini-code-assist

Code Review

This pull request enhances the evaluation functionality by adding support for multiple runs and dynamic configuration discovery. The changes are well-structured, but I've identified a critical bug in src/google/adk/cli/cli_tools_click.py that could cause an UnboundLocalError. I have provided a suggestion to fix this. Additionally, I've included a suggestion to optimize a loop for better performance when handling a large number of evaluation sets. The other changes, such as refactoring and adding new parameters, look good.

gemini-code-assist · 2026-02-14T14:09:51Z

src/google/adk/cli/cli_tools_click.py

+  eval_metrics_by_eval_set_id = {}
+  global_eval_metrics = None
+  if config_file_path:
+    eval_config = get_evaluation_criteria_or_default(config_file_path)
+    print(f"Using evaluation criteria: {eval_config}")
+    global_eval_metrics = get_eval_metrics_from_config(eval_config)


There's a potential UnboundLocalError for eval_config. It is used on line 922, but it's only defined within this if config_file_path: block, or later when handling eval set IDs. If config_file_path is not provided and the code proceeds to handle eval set file paths, eval_config will not be defined when it's needed for UserSimulatorProvider.

To fix this, you should initialize eval_config unconditionally at the beginning of the function. Applying this fix will also allow you to simplify the code in a couple of other places:

On lines 869-873, you can reuse the eval_config variable instead of calling get_evaluation_criteria_or_default again.

On lines 900-903, you can remove the redundant call to get_evaluation_criteria_or_default.

Suggested change

eval_metrics_by_eval_set_id = {}

global_eval_metrics = None

if config_file_path:

eval_config = get_evaluation_criteria_or_default(config_file_path)

print(f"Using evaluation criteria: {eval_config}")

global_eval_metrics = get_eval_metrics_from_config(eval_config)

eval_config = get_evaluation_criteria_or_default(config_file_path)

eval_metrics_by_eval_set_id = {}

global_eval_metrics = None

if config_file_path:

print(f"Using evaluation criteria: {eval_config}")

global_eval_metrics = get_eval_metrics_from_config(eval_config)

gemini-code-assist · 2026-02-14T14:09:51Z

src/google/adk/cli/cli_tools_click.py

+    eval_results = []
+    for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():
+      inference_results_for_eval_set = [
+          inference_result
+          for inference_result in inference_results
+          if inference_result.eval_set_id == eval_set_id
+      ]
+      if not inference_results_for_eval_set:
+        continue
+      eval_results.extend(
+          asyncio.run(
+              _collect_eval_results(
+                  inference_results=inference_results_for_eval_set,
+                  eval_service=eval_service,
+                  eval_metrics=eval_metrics,
+              )
+          )
+      )


The current implementation filters inference_results inside the loop for each eval_set_id. This can be inefficient if you have a large number of eval sets and inference results, as it iterates over all results for each set (O(num_eval_sets * num_inference_results)).

You can improve performance by grouping the inference results by eval_set_id once before the loop.

Suggested change

eval_results = []

for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():

inference_results_for_eval_set = [

inference_result

for inference_result in inference_results

if inference_result.eval_set_id == eval_set_id

]

if not inference_results_for_eval_set:

continue

eval_results.extend(

asyncio.run(

_collect_eval_results(

inference_results=inference_results_for_eval_set,

eval_service=eval_service,

eval_metrics=eval_metrics,

)

)

)

inference_results_by_eval_set_id = {}

for res in inference_results:

inference_results_by_eval_set_id.setdefault(res.eval_set_id, []).append(res)

eval_results = []

for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():

inference_results_for_eval_set = inference_results_by_eval_set_id.get(eval_set_id)

if not inference_results_for_eval_set:

continue

eval_results.extend(

asyncio.run(

_collect_eval_results(

inference_results=inference_results_for_eval_set,

eval_service=eval_service,

eval_metrics=eval_metrics,

)

)

)

Enhance evaluation functionality with support for multiple runs and d…

b41f559

…ynamic config discovery

adk-bot added the eval [Component] This issue is related to evaluation label Feb 14, 2026

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance evaluation functionality with support for multiple runs and d…#4495

Enhance evaluation functionality with support for multiple runs and d…#4495
jayy-77 wants to merge 1 commit intogoogle:mainfrom
jayy-77:consistency-improvement

jayy-77 commented Feb 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

adk-bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayy-77 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayy-77 commented Feb 14, 2026 •

edited

Loading