Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 6% (0.06x) speedup for _format_args_string in mlflow/metrics/genai/genai_metric.py

⏱️ Runtime : 1.29 milliseconds 1.22 milliseconds (best of 76 runs)

📝 Explanation and details

The optimized code achieves a 6% speedup through several key micro-optimizations that reduce Python interpreter overhead:

What specific optimizations were applied:

  1. Eliminated redundant dictionary lookups - Replaced if arg in eval_values: check followed by eval_values[arg] access with a single try/except KeyError pattern, avoiding the double lookup cost.

  2. Cached attribute access - Stored pd.Series as pd_Series to avoid repeated module attribute lookups in the type checking loop.

  3. Reduced variable access overhead - Created local references (columns, values) to function parameters to speed up variable resolution in the loop.

  4. Simplified empty dictionary check - Replaced args_dict is None or len(args_dict) == 0 with the more efficient not args_dict (the None check was redundant since args_dict is always initialized as {}).

  5. Streamlined return logic - Eliminated unnecessary nested conditionals and parentheses in the final return statement.

Why these optimizations lead to speedup:

In Python, dictionary key lookups (in operator + [] access) and attribute resolution (pd.Series) are relatively expensive operations. The line profiler shows the biggest time saver comes from reducing the eval_values[arg].iloc[indx] and isinstance(eval_values[arg], pd.Series) overhead (52.6% → 50.7% of total time). The try/except pattern is faster than in checks because it avoids the double hash table lookup when keys exist (the common case).

How this impacts existing workloads:

Based on the function references, _format_args_string is called within a loop in eval_fn for each prediction being evaluated (for indx, (input, output) in enumerate(zip(inputs, outputs))). This makes it a hot path function where even small optimizations compound significantly. The 6% improvement per call translates to meaningful speedup when processing large batches of LLM evaluations.

Test case performance patterns:

The optimizations show best results on large-scale test cases:

  • Large column counts: 16.1% faster with 100 columns, 18.7% faster with 999 columns
  • Mixed data types: Consistent 1-3% improvements across Series/list combinations
  • Basic cases: 8-11% improvements on simple scenarios

The performance gains scale with the number of columns being processed, making this optimization particularly valuable for comprehensive LLM evaluations with many grading context columns.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 42 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _format_args_string

# unit tests

# ----------------
# Basic Test Cases
# ----------------

def test_basic_single_column_with_list():
    # Single column, value as list
    grading_context_columns = ["col1"]
    eval_values = {"col1": ["a", "b", "c"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.87μs -> 3.50μs (10.6% faster)
    expected = "Additional information used by the model:\nkey: col1\nvalue:\nb"

def test_basic_single_column_with_series():
    # Single column, value as pandas Series
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([10, 20, 30])}
    indx = 2
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 18.9μs -> 18.5μs (2.49% faster)
    expected = "Additional information used by the model:\nkey: col1\nvalue:\n30"

def test_basic_multiple_columns_list_and_series():
    # Multiple columns, mix of list and Series
    grading_context_columns = ["col1", "col2"]
    eval_values = {
        "col1": ["x", "y", "z"],
        "col2": pd.Series([100, 200, 300])
    }
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.3μs -> 19.0μs (1.81% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: col1\nvalue:\nx\n"
        "key: col2\nvalue:\n100"
    )

def test_basic_order_preservation():
    # Order of columns in grading_context_columns is preserved
    grading_context_columns = ["col2", "col1"]
    eval_values = {
        "col1": ["a", "b", "c"],
        "col2": pd.Series([1, 2, 3])
    }
    indx = 2
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.4μs -> 19.0μs (2.10% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: col2\nvalue:\n3\n"
        "key: col1\nvalue:\nc"
    )

def test_basic_empty_grading_context_columns():
    # Empty grading_context_columns should return empty string
    grading_context_columns = []
    eval_values = {"col1": [1, 2, 3]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.78μs -> 1.74μs (2.47% faster)

# ----------------
# Edge Test Cases
# ----------------


def test_edge_missing_column_raises_exception():
    # Column not in eval_values should raise MlflowException
    grading_context_columns = ["col1", "col_missing"]
    eval_values = {"col1": [1, 2, 3]}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 10.9μs -> 11.2μs (2.24% slower)

def test_edge_empty_eval_values_with_column():
    # grading_context_columns not empty, eval_values empty, should raise
    grading_context_columns = ["col1"]
    eval_values = {}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 9.18μs -> 9.78μs (6.15% slower)

def test_edge_index_out_of_range_list():
    # Index out of range in list should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": [1, 2, 3]}
    indx = 5
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.49μs -> 2.44μs (2.30% faster)

def test_edge_index_out_of_range_series():
    # Index out of range in Series should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([1, 2, 3])}
    indx = 10
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.9μs -> 13.2μs (1.83% slower)

def test_edge_non_sequence_value_type():
    # Value in eval_values is not a list or Series, should raise TypeError
    grading_context_columns = ["col1"]
    eval_values = {"col1": 123}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.64μs -> 2.88μs (8.36% slower)

def test_edge_empty_eval_values_and_empty_columns():
    # Both grading_context_columns and eval_values are empty
    grading_context_columns = []
    eval_values = {}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.96μs -> 1.76μs (11.6% faster)

def test_edge_column_with_empty_list():
    # Column in eval_values is an empty list, should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": []}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.33μs -> 2.38μs (1.93% slower)

def test_edge_column_with_empty_series():
    # Column in eval_values is an empty Series, should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([], dtype=int)}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.5μs -> 12.6μs (0.677% slower)

def test_edge_column_with_none_value():
    # Column in eval_values is None, should raise TypeError
    grading_context_columns = ["col1"]
    eval_values = {"col1": None}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.61μs -> 2.63μs (0.533% slower)


def test_large_scale_many_columns_and_rows():
    # Test with 100 columns and 100 rows
    num_cols = 100
    num_rows = 100
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": list(range(i, i + num_rows)) for i in range(num_cols)}
    indx = 50
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 27.2μs -> 23.4μs (16.1% faster)
    # Check that all keys and expected values are present in output
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i + indx}"

def test_large_scale_many_columns_and_series():
    # Test with 200 columns, each as pandas Series, 10 rows
    num_cols = 200
    num_rows = 10
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": pd.Series(range(i, i + num_rows)) for i in range(num_cols)}
    indx = 9
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 581μs -> 567μs (2.48% faster)
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i + indx}"

def test_large_scale_long_strings():
    # Test with columns containing long string values
    grading_context_columns = ["col1", "col2"]
    long_str1 = "x" * 500
    long_str2 = "y" * 800
    eval_values = {
        "col1": [long_str1, long_str2],
        "col2": pd.Series([long_str2, long_str1])
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 18.4μs -> 18.0μs (2.19% faster)

def test_large_scale_max_elements():
    # Test with max allowed elements (999 columns, each with 2 values)
    num_cols = 999
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": [i, i+1] for i in range(num_cols)}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 206μs -> 173μs (18.7% faster)
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i+1}"

def test_large_scale_performance():
    # Performance: should not take excessive time for 500 columns and 500 rows
    import time
    num_cols = 500
    num_rows = 500
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": list(range(num_rows)) for i in range(num_cols)}
    indx = 499
    start = time.time()
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 150μs -> 133μs (13.1% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd
# imports
import pytest
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _format_args_string

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_basic_single_arg_with_list():
    # Single argument, value is a list
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar", "baz"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.92μs -> 3.61μs (8.67% faster)

def test_basic_single_arg_with_series():
    # Single argument, value is a pandas Series
    grading_context_columns = ["foo"]
    eval_values = {"foo": pd.Series(["bar", "baz"])}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 17.0μs -> 16.8μs (1.50% faster)

def test_basic_multiple_args_list_and_series():
    # Multiple arguments, mixed list and Series
    grading_context_columns = ["foo", "bar"]
    eval_values = {
        "foo": ["alpha", "beta"],
        "bar": pd.Series(["gamma", "delta"]),
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 17.7μs -> 17.4μs (1.32% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: foo\nvalue:\nbeta\n"
        "key: bar\nvalue:\ndelta"
    )

def test_basic_empty_grading_context_columns():
    # No arguments to format
    grading_context_columns = []
    eval_values = {"foo": [1, 2], "bar": [3, 4]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.79μs -> 1.51μs (18.3% faster)

def test_basic_none_grading_context_columns():
    # None as grading_context_columns should behave as empty list
    grading_context_columns = []
    eval_values = {"foo": [1, 2]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.83μs -> 1.64μs (11.8% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_edge_missing_arg_raises():
    # Argument missing from eval_values should raise MlflowException
    grading_context_columns = ["foo", "missing"]
    eval_values = {"foo": ["bar", "baz"]}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 11.0μs -> 11.1μs (1.10% slower)

def test_edge_index_out_of_range_list():
    # Index out of range for list should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar"]}
    indx = 2
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.36μs -> 2.42μs (2.52% slower)

def test_edge_index_out_of_range_series():
    # Index out of range for Series should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": pd.Series(["bar"])}
    indx = 5
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.7μs -> 12.6μs (0.485% faster)

def test_edge_empty_eval_values():
    # Empty eval_values dict, should raise for any non-empty grading_context_columns
    grading_context_columns = ["foo"]
    eval_values = {}
    indx = 0
    with pytest.raises(MlflowException):
        _format_args_string(grading_context_columns, eval_values, indx) # 9.33μs -> 9.74μs (4.25% slower)

def test_edge_non_list_non_series_value():
    # Value is not list or Series, should try indexing and raise TypeError
    grading_context_columns = ["foo"]
    eval_values = {"foo": 123}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.59μs -> 2.63μs (1.78% slower)

def test_edge_none_in_eval_values():
    # Value is None, should try indexing and raise TypeError
    grading_context_columns = ["foo"]
    eval_values = {"foo": None}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.62μs -> 2.59μs (1.31% faster)

def test_edge_empty_string_in_eval_values():
    # Value is empty string, should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": ""}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.50μs -> 2.50μs (0.200% faster)

def test_edge_non_string_column_names():
    # Non-string column names should work if present in eval_values
    grading_context_columns = [42]
    eval_values = {42: ["answer", "question"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.96μs -> 3.71μs (6.68% faster)

def test_edge_column_name_is_empty_string():
    # Column name is empty string
    grading_context_columns = [""]
    eval_values = {"": ["empty", "blank"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.63μs -> 3.58μs (1.59% faster)

def test_edge_eval_values_with_extra_keys():
    # eval_values has extra keys not in grading_context_columns
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar"], "extra": ["should", "not", "appear"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.58μs -> 3.50μs (2.31% faster)

def test_edge_column_order_preserved():
    # The order of keys in grading_context_columns is preserved in output
    grading_context_columns = ["first", "second"]
    eval_values = {"second": ["2"], "first": ["1"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.11μs -> 3.82μs (7.51% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: first\nvalue:\n1\n"
        "key: second\nvalue:\n2"
    )

def test_edge_multiple_types_in_eval_values():
    # eval_values contains Series, list, and tuple
    grading_context_columns = ["a", "b", "c"]
    eval_values = {
        "a": pd.Series([10, 20]),
        "b": [30, 40],
        "c": (50, 60),
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 20.0μs -> 19.4μs (3.13% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: a\nvalue:\n20\n"
        "key: b\nvalue:\n40\n"
        "key: c\nvalue:\n60"
    )

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_scale_many_columns_and_rows():
    # Test with 100 columns and 100 rows
    num_cols = 100
    num_rows = 100
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": [f"val{i}_{j}" for j in range(num_rows)] for i in range(num_cols)}
    indx = 99
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 26.0μs -> 22.7μs (14.7% faster)
    # Check that all columns are present and values are correct
    for i in range(num_cols):
        pass

def test_large_scale_long_strings():
    # Test with long string values
    grading_context_columns = ["long"]
    long_string = "x" * 500
    eval_values = {"long": [long_string, long_string[::-1]]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.05μs -> 3.74μs (8.26% faster)

def test_large_scale_large_series():
    # Test with a large pandas Series
    grading_context_columns = ["foo"]
    large_series = pd.Series(range(1000))
    eval_values = {"foo": large_series}
    indx = 999
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.7μs -> 19.0μs (3.77% faster)

def test_large_scale_large_tuple():
    # Test with a large tuple
    grading_context_columns = ["bar"]
    large_tuple = tuple(range(1000))
    eval_values = {"bar": large_tuple}
    indx = 500
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.08μs -> 3.71μs (10.1% faster)

def test_large_scale_many_columns_some_missing():
    # Test with some columns missing in eval_values (should raise)
    grading_context_columns = [f"col{i}" for i in range(10)]
    eval_values = {f"col{i}": [i] for i in range(9)}  # col9 missing
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 13.2μs -> 12.9μs (2.28% faster)

def test_large_scale_empty_values():
    # All columns present but values are empty lists
    grading_context_columns = ["foo", "bar"]
    eval_values = {"foo": [], "bar": []}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.27μs -> 2.31μs (1.73% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_format_args_string-mhx4fh7i and push.

Codeflash Static Badge

The optimized code achieves a **6% speedup** through several key micro-optimizations that reduce Python interpreter overhead:

**What specific optimizations were applied:**

1. **Eliminated redundant dictionary lookups** - Replaced `if arg in eval_values:` check followed by `eval_values[arg]` access with a single `try/except KeyError` pattern, avoiding the double lookup cost.

2. **Cached attribute access** - Stored `pd.Series` as `pd_Series` to avoid repeated module attribute lookups in the type checking loop.

3. **Reduced variable access overhead** - Created local references (`columns`, `values`) to function parameters to speed up variable resolution in the loop.

4. **Simplified empty dictionary check** - Replaced `args_dict is None or len(args_dict) == 0` with the more efficient `not args_dict` (the None check was redundant since `args_dict` is always initialized as `{}`).

5. **Streamlined return logic** - Eliminated unnecessary nested conditionals and parentheses in the final return statement.

**Why these optimizations lead to speedup:**

In Python, dictionary key lookups (`in` operator + `[]` access) and attribute resolution (`pd.Series`) are relatively expensive operations. The line profiler shows the biggest time saver comes from reducing the `eval_values[arg].iloc[indx]` and `isinstance(eval_values[arg], pd.Series)` overhead (52.6% → 50.7% of total time). The `try/except` pattern is faster than `in` checks because it avoids the double hash table lookup when keys exist (the common case).

**How this impacts existing workloads:**

Based on the function references, `_format_args_string` is called within a loop in `eval_fn` for each prediction being evaluated (`for indx, (input, output) in enumerate(zip(inputs, outputs))`). This makes it a hot path function where even small optimizations compound significantly. The 6% improvement per call translates to meaningful speedup when processing large batches of LLM evaluations.

**Test case performance patterns:**

The optimizations show best results on large-scale test cases:
- **Large column counts**: 16.1% faster with 100 columns, 18.7% faster with 999 columns  
- **Mixed data types**: Consistent 1-3% improvements across Series/list combinations
- **Basic cases**: 8-11% improvements on simple scenarios

The performance gains scale with the number of columns being processed, making this optimization particularly valuable for comprehensive LLM evaluations with many grading context columns.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 07:42
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant