Skip to content

Enhance evaluation functionality with support for multiple runs and d…#4495

Draft
jayy-77 wants to merge 1 commit intogoogle:mainfrom
jayy-77:consistency-improvement
Draft

Enhance evaluation functionality with support for multiple runs and d…#4495
jayy-77 wants to merge 1 commit intogoogle:mainfrom
jayy-77:consistency-improvement

Conversation

@jayy-77
Copy link

@jayy-77 jayy-77 commented Feb 14, 2026

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jayy-77, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation framework by introducing the ability to run evaluation cases multiple times, providing more statistically sound results. It also improves the flexibility of evaluation setup by allowing dynamic discovery of configuration files alongside test data, reducing the need for explicit global configuration. Furthermore, the changes broaden the compatibility of agent loading, supporting both synchronous and asynchronous agent initialization patterns.

Highlights

  • Multiple Evaluation Runs: Introduced a new --num_runs CLI option, allowing users to specify how many times each evaluation case should be executed, enhancing the robustness of evaluation results.
  • Dynamic Evaluation Config Discovery: Implemented a mechanism to dynamically discover test_config.json files adjacent to evaluation set files, enabling more flexible and localized evaluation criteria without requiring a global config file.
  • Flexible Agent Loading: Enhanced the get_root_agent function to support both direct root_agent exports and asynchronous get_agent_async methods from agent modules, improving compatibility with different agent implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/google/adk/cli/cli_eval.py
    • Imported asyncio for asynchronous operations.
    • Updated agent import from Agent to BaseAgent.
    • Modified get_root_agent to support get_agent_async for asynchronous agent loading and changed its return type to BaseAgent.
    • Refactored evaluation metric loading to support global and per-eval-set configurations.
    • Implemented dynamic discovery of evaluation configurations for individual eval sets.
    • Adjusted the evaluation loop to accommodate multiple runs and apply specific evaluation metrics per eval set.
  • src/google/adk/cli/cli_tools_click.py
    • Added a --num_runs option to the cli_eval command, defaulting to 1.
    • Included num_runs as a parameter in the cli_eval function signature.
    • Imported discover_eval_config_for_test_file for dynamic configuration.
    • Revised the logic for loading evaluation configurations, supporting global and dynamically discovered per-eval-set metrics.
    • Integrated num_runs into the inference collection process, repeating inference requests as specified.
    • Updated the evaluation results collection to process results based on dynamically assigned evaluation metrics per eval set.
  • src/google/adk/evaluation/agent_evaluator.py
    • Imported discover_eval_config_for_test_file and EvalSetResultsManager.
    • Refactored find_config_for_test_file to utilize the new discover_eval_config_for_test_file utility.
    • Added an optional eval_set_results_manager parameter to evaluate_eval_set, evaluate, and _get_eval_results_by_eval_id functions to support persisting evaluation outputs.
  • src/google/adk/evaluation/eval_config.py
    • Added a new function discover_eval_config_for_test_file to locate test_config.json files in the same directory as a given test file.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Feb 14, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 14, 2026

Response from ADK Triaging Agent

Hello @jayy-77, thank you for your contribution!

To help us review this pull request, could you please fill out the template in the PR description? Specifically, we need:

  • A link to an existing issue or a description of the change.
  • A testing plan that describes how you've verified your changes.

This information is important for our review process as outlined in our contribution guidelines. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the evaluation functionality by adding support for multiple runs and dynamic configuration discovery. The changes are well-structured, but I've identified a critical bug in src/google/adk/cli/cli_tools_click.py that could cause an UnboundLocalError. I have provided a suggestion to fix this. Additionally, I've included a suggestion to optimize a loop for better performance when handling a large number of evaluation sets. The other changes, such as refactoring and adding new parameters, look good.

Comment on lines +820 to +825
eval_metrics_by_eval_set_id = {}
global_eval_metrics = None
if config_file_path:
eval_config = get_evaluation_criteria_or_default(config_file_path)
print(f"Using evaluation criteria: {eval_config}")
global_eval_metrics = get_eval_metrics_from_config(eval_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential UnboundLocalError for eval_config. It is used on line 922, but it's only defined within this if config_file_path: block, or later when handling eval set IDs. If config_file_path is not provided and the code proceeds to handle eval set file paths, eval_config will not be defined when it's needed for UserSimulatorProvider.

To fix this, you should initialize eval_config unconditionally at the beginning of the function. Applying this fix will also allow you to simplify the code in a couple of other places:

  1. On lines 869-873, you can reuse the eval_config variable instead of calling get_evaluation_criteria_or_default again.
  2. On lines 900-903, you can remove the redundant call to get_evaluation_criteria_or_default.
Suggested change
eval_metrics_by_eval_set_id = {}
global_eval_metrics = None
if config_file_path:
eval_config = get_evaluation_criteria_or_default(config_file_path)
print(f"Using evaluation criteria: {eval_config}")
global_eval_metrics = get_eval_metrics_from_config(eval_config)
eval_config = get_evaluation_criteria_or_default(config_file_path)
eval_metrics_by_eval_set_id = {}
global_eval_metrics = None
if config_file_path:
print(f"Using evaluation criteria: {eval_config}")
global_eval_metrics = get_eval_metrics_from_config(eval_config)

Comment on lines +959 to +976
eval_results = []
for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():
inference_results_for_eval_set = [
inference_result
for inference_result in inference_results
if inference_result.eval_set_id == eval_set_id
]
if not inference_results_for_eval_set:
continue
eval_results.extend(
asyncio.run(
_collect_eval_results(
inference_results=inference_results_for_eval_set,
eval_service=eval_service,
eval_metrics=eval_metrics,
)
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation filters inference_results inside the loop for each eval_set_id. This can be inefficient if you have a large number of eval sets and inference results, as it iterates over all results for each set (O(num_eval_sets * num_inference_results)).

You can improve performance by grouping the inference results by eval_set_id once before the loop.

Suggested change
eval_results = []
for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():
inference_results_for_eval_set = [
inference_result
for inference_result in inference_results
if inference_result.eval_set_id == eval_set_id
]
if not inference_results_for_eval_set:
continue
eval_results.extend(
asyncio.run(
_collect_eval_results(
inference_results=inference_results_for_eval_set,
eval_service=eval_service,
eval_metrics=eval_metrics,
)
)
)
inference_results_by_eval_set_id = {}
for res in inference_results:
inference_results_by_eval_set_id.setdefault(res.eval_set_id, []).append(res)
eval_results = []
for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items():
inference_results_for_eval_set = inference_results_by_eval_set_id.get(eval_set_id)
if not inference_results_for_eval_set:
continue
eval_results.extend(
asyncio.run(
_collect_eval_results(
inference_results=inference_results_for_eval_set,
eval_service=eval_service,
eval_metrics=eval_metrics,
)
)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consistency improvements between adk eval and AgentEvaluator

2 participants