⚡️ Speed up function _get_aggregate_results by 216%
#174
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 216% (2.16x) speedup for
_get_aggregate_resultsinmlflow/metrics/genai/genai_metric.py⏱️ Runtime :
40.4 milliseconds→12.8 milliseconds(best of89runs)📝 Explanation and details
The optimization achieves a 216% speedup by eliminating redundant operations and leveraging NumPy's performance advantages. Here are the key changes that drive this improvement:
Primary Performance Gains:
import numpy as npoverhead on every function call (saves ~2.3% of original runtime based on line profiler)AGGREGATE_FUNCTIONSdictionary is created once at the top level instead of being reconstructed inside the nested function on every callscores_for_aggregationto a NumPy array once upfront, rather than having each aggregation function work with Python lists and potentially convert repeatedlyx.size > 0for the p90 lambda instead of truthiness check on the array, which is more efficient for NumPy arraysPerformance Impact by Test Case:
test_large_scores_and_large_aggregations, 60%+ faster on 1000+ element arraysContext Impact:
Based on the function references,
_get_aggregate_resultsis called fromeval_fnin bothmake_genai_metricandmake_genai_metric_from_prompt, which are core LLM evaluation functions. Since LLM evaluation often involves processing many predictions with multiple aggregation metrics, this optimization significantly improves evaluation pipeline performance, especially when evaluating large datasets or using multiple aggregation options simultaneously.The optimization is particularly effective because it transforms the bottleneck from Python object overhead (97.6% of original runtime was in the dictionary comprehension) to efficient NumPy vectorized operations.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-_get_aggregate_results-mhx4qr7fand push.