Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 216% (2.16x) speedup for _get_aggregate_results in mlflow/metrics/genai/genai_metric.py

⏱️ Runtime : 40.4 milliseconds 12.8 milliseconds (best of 89 runs)

📝 Explanation and details

The optimization achieves a 216% speedup by eliminating redundant operations and leveraging NumPy's performance advantages. Here are the key changes that drive this improvement:

Primary Performance Gains:

  1. Move NumPy import to module level - Eliminates repeated import numpy as np overhead on every function call (saves ~2.3% of original runtime based on line profiler)
  2. Pre-define aggregation functions dictionary - The AGGREGATE_FUNCTIONS dictionary is created once at the top level instead of being reconstructed inside the nested function on every call
  3. Single NumPy array conversion - Convert scores_for_aggregation to a NumPy array once upfront, rather than having each aggregation function work with Python lists and potentially convert repeatedly
  4. Optimized empty array check - Use x.size > 0 for the p90 lambda instead of truthiness check on the array, which is more efficient for NumPy arrays

Performance Impact by Test Case:

  • Best gains on large-scale tests: 815% faster for test_large_scores_and_large_aggregations, 60%+ faster on 1000+ element arrays
  • Moderate gains on typical use cases: 4-10% faster for basic aggregation operations with small to medium datasets
  • Some regressions on edge cases with no aggregations (empty lists, None aggregations) due to upfront NumPy array creation overhead, but these represent uncommon usage patterns

Context Impact:
Based on the function references, _get_aggregate_results is called from eval_fn in both make_genai_metric and make_genai_metric_from_prompt, which are core LLM evaluation functions. Since LLM evaluation often involves processing many predictions with multiple aggregation metrics, this optimization significantly improves evaluation pipeline performance, especially when evaluating large datasets or using multiple aggregation options simultaneously.

The optimization is particularly effective because it transforms the bottleneck from Python object overhead (97.6% of original runtime was in the dictionary comprehension) to efficient NumPy vectorized operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 40 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
# function to test
# (copied from the question prompt)
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _get_aggregate_results
from mlflow.protos.databricks_pb2 import INVALID_PARAMETER_VALUE

# unit tests

# ---- BASIC TEST CASES ----

def test_single_score_all_aggregations():
    # Single score: all aggregations should return the value or 0 for variance
    scores = [42]
    aggs = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 170μs -> 163μs (4.16% faster)

def test_multiple_scores_basic_aggregations():
    # Basic aggregation with multiple scores
    scores = [1, 2, 3, 4, 5]
    aggs = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 171μs -> 163μs (4.97% faster)

def test_scores_with_floats():
    # Scores with floats
    scores = [1.5, 2.5, 3.5]
    aggs = ["mean", "median", "variance"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 95.1μs -> 90.3μs (5.30% faster)

def test_aggregations_subset():
    # Only a subset of aggregations requested
    scores = [10, 20, 30]
    aggs = ["min", "max"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 28.3μs -> 27.3μs (3.50% faster)

def test_aggregations_empty_list():
    # No aggregations requested
    scores = [1, 2, 3]
    aggs = []
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 1.60μs -> 8.65μs (81.5% slower)

def test_aggregations_none():
    # Aggregations is None: should return empty dict
    scores = [1, 2, 3]
    codeflash_output = _get_aggregate_results(scores, None); result = codeflash_output # 1.30μs -> 2.55μs (48.9% slower)

# ---- EDGE TEST CASES ----

def test_scores_empty():
    # Empty scores list: all aggregations should raise ValueError
    scores = []
    aggs = ["min", "max", "mean", "median", "variance", "p90"]
    with pytest.raises(ValueError):
        _get_aggregate_results(scores, aggs) # 19.6μs -> 19.3μs (1.73% faster)


def test_scores_some_none():
    # Some scores are None: should ignore None and compute on remaining
    scores = [None, 5, None, 15]
    aggs = ["min", "max", "mean"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 48.6μs -> 44.5μs (9.18% faster)

def test_invalid_aggregation_option():
    # Invalid aggregation option should raise MlflowException
    scores = [1, 2, 3]
    aggs = ["foo"]
    with pytest.raises(MlflowException) as excinfo:
        _get_aggregate_results(scores, aggs) # 11.0μs -> 15.6μs (29.3% slower)

def test_p90_with_single_score():
    # p90 with a single score should return that score
    scores = [7]
    aggs = ["p90"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 111μs -> 112μs (0.092% slower)


def test_variance_with_identical_scores():
    # Variance of identical scores should be 0
    scores = [2, 2, 2, 2]
    aggs = ["variance"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 53.3μs -> 52.2μs (2.15% faster)

def test_non_numeric_scores():
    # Non-numeric scores should raise TypeError
    scores = ["a", "b", "c"]
    aggs = ["mean"]
    with pytest.raises(TypeError):
        _get_aggregate_results(scores, aggs) # 42.0μs -> 42.1μs (0.240% slower)

def test_mixed_numeric_and_non_numeric_scores():
    # Mixed numeric and non-numeric scores should raise TypeError after filtering
    scores = [1, "a", None]
    aggs = ["mean"]
    with pytest.raises(TypeError):
        _get_aggregate_results(scores, aggs) # 43.1μs -> 42.7μs (0.831% faster)

# ---- LARGE SCALE TEST CASES ----

def test_large_number_of_scores():
    # Large number of scores (up to 1000), verify mean and variance
    scores = list(range(1000))  # 0 to 999
    aggs = ["mean", "variance"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 129μs -> 102μs (25.8% faster)
    expected_mean = (999) / 2  # (sum 0..999)/1000 = 499.5
    expected_variance = sum((x - expected_mean) ** 2 for x in scores) / 1000

def test_large_scores_with_none():
    # Large list with many None values
    scores = [None]*500 + list(range(500))  # 500 None, 0-499
    aggs = ["min", "max", "mean"]
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 98.8μs -> 69.3μs (42.6% faster)

def test_large_number_of_aggregations():
    # Large number of aggregation requests (all valid, repeated)
    scores = [1, 2, 3, 4, 5]
    aggs = ["min", "max", "mean", "median", "variance", "p90"] * 100  # 600 aggregations
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 6.97ms -> 6.25ms (11.5% faster)

def test_large_scores_and_large_aggregations():
    # Both large scores and large aggregations (stress test)
    scores = list(range(1000))
    aggs = ["mean"] * 1000
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 29.5ms -> 3.22ms (815% faster)

def test_performance_with_large_input(monkeypatch):
    # Performance test: should not take excessive time or memory
    import time
    scores = list(range(1000))
    aggs = ["mean", "variance", "p90"]
    start = time.time()
    codeflash_output = _get_aggregate_results(scores, aggs); result = codeflash_output # 237μs -> 186μs (27.6% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest  # used for our unit tests
# function to test
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _get_aggregate_results
from mlflow.protos.databricks_pb2 import INVALID_PARAMETER_VALUE

# unit tests

# ------------------------
# Basic Test Cases
# ------------------------

def test_basic_min_max_mean():
    # Test with simple integer scores and basic aggregations
    scores = [1, 2, 3, 4, 5]
    aggregations = ["min", "max", "mean"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 48.4μs -> 43.9μs (10.2% faster)

def test_basic_median_variance():
    # Test with an odd number of scores for median and variance
    scores = [1, 2, 3, 4, 5]
    aggregations = ["median", "variance"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 79.0μs -> 75.8μs (4.18% faster)

def test_basic_p90():
    # Test with p90 aggregation
    scores = [10, 20, 30, 40, 50]
    aggregations = ["p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 108μs -> 109μs (1.08% slower)

def test_basic_multiple_aggregations():
    # Test with all aggregations together
    scores = [1, 2, 3, 4, 5]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 169μs -> 162μs (4.06% faster)

# ------------------------
# Edge Test Cases
# ------------------------

def test_empty_scores():
    # Test with empty scores list: all aggregations should raise or return None for p90
    scores = []
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    # min, max, mean, median, variance should raise ValueError (from numpy)
    # p90 should return None
    for agg in ["min", "max", "mean", "median", "variance"]:
        with pytest.raises(ValueError):
            _get_aggregate_results(scores, [agg])
    # p90 returns None for empty list
    codeflash_output = _get_aggregate_results(scores, ["p90"]); result = codeflash_output

def test_none_scores_filtered():
    # Test with scores containing None values, which should be filtered out
    scores = [None, 2, None, 4, 6]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 171μs -> 162μs (5.60% faster)

def test_all_none_scores():
    # If all scores are None, scores_for_aggregation should be empty
    scores = [None, None]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    for agg in ["min", "max", "mean", "median", "variance"]:
        with pytest.raises(ValueError):
            _get_aggregate_results(scores, [agg])
    codeflash_output = _get_aggregate_results(scores, ["p90"]); result = codeflash_output

def test_single_score():
    # Test with a single score
    scores = [42]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 170μs -> 162μs (4.48% faster)

def test_invalid_aggregation_option():
    # Test with an invalid aggregation option
    scores = [1, 2, 3]
    aggregations = ["min", "max", "invalid_option"]
    with pytest.raises(MlflowException) as excinfo:
        _get_aggregate_results(scores, aggregations) # 37.2μs -> 34.2μs (8.59% faster)

def test_aggregations_is_none():
    # If aggregations is None, should return empty dict
    scores = [1, 2, 3]
    aggregations = None
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 1.31μs -> 2.68μs (51.3% slower)

def test_scores_with_negative_and_zero():
    # Test with negative, zero and positive scores
    scores = [-10, 0, 10, 20]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 171μs -> 161μs (5.78% faster)

def test_scores_with_floats():
    # Test with float scores
    scores = [1.1, 2.2, 3.3, 4.4]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 176μs -> 168μs (4.71% faster)

def test_scores_with_duplicates():
    # Test with duplicate values in scores
    scores = [5, 5, 5, 5, 5]
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 170μs -> 163μs (4.07% faster)

# ------------------------
# Large Scale Test Cases
# ------------------------

def test_large_scale_all_aggregations():
    # Test with a large list of scores
    scores = list(range(1, 1001))  # 1 to 1000
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 340μs -> 212μs (60.2% faster)
    # Variance of 1..1000 is ((n^2 - 1)/12), n=1000
    expected_var = ((1000**2 - 1) / 12)

def test_large_scale_with_none_values():
    # Test with a large list of scores, some None values
    scores = [None] * 100 + list(range(1, 901))  # 100 None, 1..900
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 327μs -> 208μs (56.9% faster)
    expected_var = ((900**2 - 1) / 12)

def test_large_scale_all_same_value():
    # Test with a large list of the same value
    scores = [7] * 1000
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 340μs -> 211μs (61.1% faster)

def test_large_scale_with_floats():
    # Test with a large list of floats
    scores = [float(i) / 10 for i in range(1000)]  # 0.0, 0.1, ..., 99.9
    aggregations = ["min", "max", "mean", "median", "variance", "p90"]
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 333μs -> 213μs (56.0% faster)
    # Variance of evenly spaced floats
    expected_var = ((999**2 - 1) / 12) / 100  # scale down by 100

def test_large_scale_empty_aggregations():
    # Test with large scores but no aggregations
    scores = list(range(1, 1001))
    aggregations = []
    codeflash_output = _get_aggregate_results(scores, aggregations); result = codeflash_output # 17.4μs -> 48.3μs (64.0% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_get_aggregate_results-mhx4qr7f and push.

Codeflash Static Badge

The optimization achieves a **216% speedup** by eliminating redundant operations and leveraging NumPy's performance advantages. Here are the key changes that drive this improvement:

**Primary Performance Gains:**
1. **Move NumPy import to module level** - Eliminates repeated `import numpy as np` overhead on every function call (saves ~2.3% of original runtime based on line profiler)
2. **Pre-define aggregation functions dictionary** - The `AGGREGATE_FUNCTIONS` dictionary is created once at the top level instead of being reconstructed inside the nested function on every call
3. **Single NumPy array conversion** - Convert `scores_for_aggregation` to a NumPy array once upfront, rather than having each aggregation function work with Python lists and potentially convert repeatedly
4. **Optimized empty array check** - Use `x.size > 0` for the p90 lambda instead of truthiness check on the array, which is more efficient for NumPy arrays

**Performance Impact by Test Case:**
- **Best gains** on large-scale tests: 815% faster for `test_large_scores_and_large_aggregations`, 60%+ faster on 1000+ element arrays
- **Moderate gains** on typical use cases: 4-10% faster for basic aggregation operations with small to medium datasets  
- **Some regressions** on edge cases with no aggregations (empty lists, None aggregations) due to upfront NumPy array creation overhead, but these represent uncommon usage patterns

**Context Impact:**
Based on the function references, `_get_aggregate_results` is called from `eval_fn` in both `make_genai_metric` and `make_genai_metric_from_prompt`, which are core LLM evaluation functions. Since LLM evaluation often involves processing many predictions with multiple aggregation metrics, this optimization significantly improves evaluation pipeline performance, especially when evaluating large datasets or using multiple aggregation options simultaneously.

The optimization is particularly effective because it transforms the bottleneck from Python object overhead (97.6% of original runtime was in the dictionary comprehension) to efficient NumPy vectorized operations.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 07:51
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant