⚡️ Speed up function `_format_args_string` by 6% #173

codeflash-ai · 2025-11-13T07:42:19Z

📄 6% (0.06x) speedup for `_format_args_string` in `mlflow/metrics/genai/genai_metric.py`

⏱️ Runtime : 1.29 milliseconds → 1.22 milliseconds (best of 76 runs)

📝 Explanation and details

The optimized code achieves a 6% speedup through several key micro-optimizations that reduce Python interpreter overhead:

What specific optimizations were applied:

Eliminated redundant dictionary lookups - Replaced if arg in eval_values: check followed by eval_values[arg] access with a single try/except KeyError pattern, avoiding the double lookup cost.
Cached attribute access - Stored pd.Series as pd_Series to avoid repeated module attribute lookups in the type checking loop.
Reduced variable access overhead - Created local references (columns, values) to function parameters to speed up variable resolution in the loop.
Simplified empty dictionary check - Replaced args_dict is None or len(args_dict) == 0 with the more efficient not args_dict (the None check was redundant since args_dict is always initialized as {}).
Streamlined return logic - Eliminated unnecessary nested conditionals and parentheses in the final return statement.

Why these optimizations lead to speedup:

In Python, dictionary key lookups (in operator + [] access) and attribute resolution (pd.Series) are relatively expensive operations. The line profiler shows the biggest time saver comes from reducing the eval_values[arg].iloc[indx] and isinstance(eval_values[arg], pd.Series) overhead (52.6% → 50.7% of total time). The try/except pattern is faster than in checks because it avoids the double hash table lookup when keys exist (the common case).

How this impacts existing workloads:

Based on the function references, _format_args_string is called within a loop in eval_fn for each prediction being evaluated (for indx, (input, output) in enumerate(zip(inputs, outputs))). This makes it a hot path function where even small optimizations compound significantly. The 6% improvement per call translates to meaningful speedup when processing large batches of LLM evaluations.

Test case performance patterns:

The optimizations show best results on large-scale test cases:

Large column counts: 16.1% faster with 100 columns, 18.7% faster with 999 columns
Mixed data types: Consistent 1-3% improvements across Series/list combinations
Basic cases: 8-11% improvements on simple scenarios

The performance gains scale with the number of columns being processed, making this optimization particularly valuable for comprehensive LLM evaluations with many grading context columns.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 42 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pandas as pd
# imports
import pytest  # used for our unit tests
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _format_args_string

# unit tests

# ----------------
# Basic Test Cases
# ----------------

def test_basic_single_column_with_list():
    # Single column, value as list
    grading_context_columns = ["col1"]
    eval_values = {"col1": ["a", "b", "c"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.87μs -> 3.50μs (10.6% faster)
    expected = "Additional information used by the model:\nkey: col1\nvalue:\nb"

def test_basic_single_column_with_series():
    # Single column, value as pandas Series
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([10, 20, 30])}
    indx = 2
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 18.9μs -> 18.5μs (2.49% faster)
    expected = "Additional information used by the model:\nkey: col1\nvalue:\n30"

def test_basic_multiple_columns_list_and_series():
    # Multiple columns, mix of list and Series
    grading_context_columns = ["col1", "col2"]
    eval_values = {
        "col1": ["x", "y", "z"],
        "col2": pd.Series([100, 200, 300])
    }
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.3μs -> 19.0μs (1.81% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: col1\nvalue:\nx\n"
        "key: col2\nvalue:\n100"
    )

def test_basic_order_preservation():
    # Order of columns in grading_context_columns is preserved
    grading_context_columns = ["col2", "col1"]
    eval_values = {
        "col1": ["a", "b", "c"],
        "col2": pd.Series([1, 2, 3])
    }
    indx = 2
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.4μs -> 19.0μs (2.10% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: col2\nvalue:\n3\n"
        "key: col1\nvalue:\nc"
    )

def test_basic_empty_grading_context_columns():
    # Empty grading_context_columns should return empty string
    grading_context_columns = []
    eval_values = {"col1": [1, 2, 3]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.78μs -> 1.74μs (2.47% faster)

# ----------------
# Edge Test Cases
# ----------------


def test_edge_missing_column_raises_exception():
    # Column not in eval_values should raise MlflowException
    grading_context_columns = ["col1", "col_missing"]
    eval_values = {"col1": [1, 2, 3]}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 10.9μs -> 11.2μs (2.24% slower)

def test_edge_empty_eval_values_with_column():
    # grading_context_columns not empty, eval_values empty, should raise
    grading_context_columns = ["col1"]
    eval_values = {}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 9.18μs -> 9.78μs (6.15% slower)

def test_edge_index_out_of_range_list():
    # Index out of range in list should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": [1, 2, 3]}
    indx = 5
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.49μs -> 2.44μs (2.30% faster)

def test_edge_index_out_of_range_series():
    # Index out of range in Series should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([1, 2, 3])}
    indx = 10
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.9μs -> 13.2μs (1.83% slower)

def test_edge_non_sequence_value_type():
    # Value in eval_values is not a list or Series, should raise TypeError
    grading_context_columns = ["col1"]
    eval_values = {"col1": 123}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.64μs -> 2.88μs (8.36% slower)

def test_edge_empty_eval_values_and_empty_columns():
    # Both grading_context_columns and eval_values are empty
    grading_context_columns = []
    eval_values = {}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.96μs -> 1.76μs (11.6% faster)

def test_edge_column_with_empty_list():
    # Column in eval_values is an empty list, should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": []}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.33μs -> 2.38μs (1.93% slower)

def test_edge_column_with_empty_series():
    # Column in eval_values is an empty Series, should raise IndexError
    grading_context_columns = ["col1"]
    eval_values = {"col1": pd.Series([], dtype=int)}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.5μs -> 12.6μs (0.677% slower)

def test_edge_column_with_none_value():
    # Column in eval_values is None, should raise TypeError
    grading_context_columns = ["col1"]
    eval_values = {"col1": None}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.61μs -> 2.63μs (0.533% slower)


def test_large_scale_many_columns_and_rows():
    # Test with 100 columns and 100 rows
    num_cols = 100
    num_rows = 100
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": list(range(i, i + num_rows)) for i in range(num_cols)}
    indx = 50
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 27.2μs -> 23.4μs (16.1% faster)
    # Check that all keys and expected values are present in output
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i + indx}"

def test_large_scale_many_columns_and_series():
    # Test with 200 columns, each as pandas Series, 10 rows
    num_cols = 200
    num_rows = 10
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": pd.Series(range(i, i + num_rows)) for i in range(num_cols)}
    indx = 9
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 581μs -> 567μs (2.48% faster)
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i + indx}"

def test_large_scale_long_strings():
    # Test with columns containing long string values
    grading_context_columns = ["col1", "col2"]
    long_str1 = "x" * 500
    long_str2 = "y" * 800
    eval_values = {
        "col1": [long_str1, long_str2],
        "col2": pd.Series([long_str2, long_str1])
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 18.4μs -> 18.0μs (2.19% faster)

def test_large_scale_max_elements():
    # Test with max allowed elements (999 columns, each with 2 values)
    num_cols = 999
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": [i, i+1] for i in range(num_cols)}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 206μs -> 173μs (18.7% faster)
    for i in range(num_cols):
        expected_key = f"key: col{i}\nvalue:\n{i+1}"

def test_large_scale_performance():
    # Performance: should not take excessive time for 500 columns and 500 rows
    import time
    num_cols = 500
    num_rows = 500
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": list(range(num_rows)) for i in range(num_cols)}
    indx = 499
    start = time.time()
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 150μs -> 133μs (13.1% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd
# imports
import pytest
from mlflow.exceptions import MlflowException
from mlflow.metrics.genai.genai_metric import _format_args_string

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_basic_single_arg_with_list():
    # Single argument, value is a list
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar", "baz"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.92μs -> 3.61μs (8.67% faster)

def test_basic_single_arg_with_series():
    # Single argument, value is a pandas Series
    grading_context_columns = ["foo"]
    eval_values = {"foo": pd.Series(["bar", "baz"])}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 17.0μs -> 16.8μs (1.50% faster)

def test_basic_multiple_args_list_and_series():
    # Multiple arguments, mixed list and Series
    grading_context_columns = ["foo", "bar"]
    eval_values = {
        "foo": ["alpha", "beta"],
        "bar": pd.Series(["gamma", "delta"]),
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 17.7μs -> 17.4μs (1.32% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: foo\nvalue:\nbeta\n"
        "key: bar\nvalue:\ndelta"
    )

def test_basic_empty_grading_context_columns():
    # No arguments to format
    grading_context_columns = []
    eval_values = {"foo": [1, 2], "bar": [3, 4]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.79μs -> 1.51μs (18.3% faster)

def test_basic_none_grading_context_columns():
    # None as grading_context_columns should behave as empty list
    grading_context_columns = []
    eval_values = {"foo": [1, 2]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 1.83μs -> 1.64μs (11.8% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_edge_missing_arg_raises():
    # Argument missing from eval_values should raise MlflowException
    grading_context_columns = ["foo", "missing"]
    eval_values = {"foo": ["bar", "baz"]}
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 11.0μs -> 11.1μs (1.10% slower)

def test_edge_index_out_of_range_list():
    # Index out of range for list should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar"]}
    indx = 2
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.36μs -> 2.42μs (2.52% slower)

def test_edge_index_out_of_range_series():
    # Index out of range for Series should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": pd.Series(["bar"])}
    indx = 5
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 12.7μs -> 12.6μs (0.485% faster)

def test_edge_empty_eval_values():
    # Empty eval_values dict, should raise for any non-empty grading_context_columns
    grading_context_columns = ["foo"]
    eval_values = {}
    indx = 0
    with pytest.raises(MlflowException):
        _format_args_string(grading_context_columns, eval_values, indx) # 9.33μs -> 9.74μs (4.25% slower)

def test_edge_non_list_non_series_value():
    # Value is not list or Series, should try indexing and raise TypeError
    grading_context_columns = ["foo"]
    eval_values = {"foo": 123}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.59μs -> 2.63μs (1.78% slower)

def test_edge_none_in_eval_values():
    # Value is None, should try indexing and raise TypeError
    grading_context_columns = ["foo"]
    eval_values = {"foo": None}
    indx = 0
    with pytest.raises(TypeError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.62μs -> 2.59μs (1.31% faster)

def test_edge_empty_string_in_eval_values():
    # Value is empty string, should raise IndexError
    grading_context_columns = ["foo"]
    eval_values = {"foo": ""}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.50μs -> 2.50μs (0.200% faster)

def test_edge_non_string_column_names():
    # Non-string column names should work if present in eval_values
    grading_context_columns = [42]
    eval_values = {42: ["answer", "question"]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.96μs -> 3.71μs (6.68% faster)

def test_edge_column_name_is_empty_string():
    # Column name is empty string
    grading_context_columns = [""]
    eval_values = {"": ["empty", "blank"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.63μs -> 3.58μs (1.59% faster)

def test_edge_eval_values_with_extra_keys():
    # eval_values has extra keys not in grading_context_columns
    grading_context_columns = ["foo"]
    eval_values = {"foo": ["bar"], "extra": ["should", "not", "appear"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 3.58μs -> 3.50μs (2.31% faster)

def test_edge_column_order_preserved():
    # The order of keys in grading_context_columns is preserved in output
    grading_context_columns = ["first", "second"]
    eval_values = {"second": ["2"], "first": ["1"]}
    indx = 0
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.11μs -> 3.82μs (7.51% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: first\nvalue:\n1\n"
        "key: second\nvalue:\n2"
    )

def test_edge_multiple_types_in_eval_values():
    # eval_values contains Series, list, and tuple
    grading_context_columns = ["a", "b", "c"]
    eval_values = {
        "a": pd.Series([10, 20]),
        "b": [30, 40],
        "c": (50, 60),
    }
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 20.0μs -> 19.4μs (3.13% faster)
    expected = (
        "Additional information used by the model:\n"
        "key: a\nvalue:\n20\n"
        "key: b\nvalue:\n40\n"
        "key: c\nvalue:\n60"
    )

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_scale_many_columns_and_rows():
    # Test with 100 columns and 100 rows
    num_cols = 100
    num_rows = 100
    grading_context_columns = [f"col{i}" for i in range(num_cols)]
    eval_values = {f"col{i}": [f"val{i}_{j}" for j in range(num_rows)] for i in range(num_cols)}
    indx = 99
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 26.0μs -> 22.7μs (14.7% faster)
    # Check that all columns are present and values are correct
    for i in range(num_cols):
        pass

def test_large_scale_long_strings():
    # Test with long string values
    grading_context_columns = ["long"]
    long_string = "x" * 500
    eval_values = {"long": [long_string, long_string[::-1]]}
    indx = 1
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.05μs -> 3.74μs (8.26% faster)

def test_large_scale_large_series():
    # Test with a large pandas Series
    grading_context_columns = ["foo"]
    large_series = pd.Series(range(1000))
    eval_values = {"foo": large_series}
    indx = 999
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 19.7μs -> 19.0μs (3.77% faster)

def test_large_scale_large_tuple():
    # Test with a large tuple
    grading_context_columns = ["bar"]
    large_tuple = tuple(range(1000))
    eval_values = {"bar": large_tuple}
    indx = 500
    codeflash_output = _format_args_string(grading_context_columns, eval_values, indx); result = codeflash_output # 4.08μs -> 3.71μs (10.1% faster)

def test_large_scale_many_columns_some_missing():
    # Test with some columns missing in eval_values (should raise)
    grading_context_columns = [f"col{i}" for i in range(10)]
    eval_values = {f"col{i}": [i] for i in range(9)}  # col9 missing
    indx = 0
    with pytest.raises(MlflowException) as excinfo:
        _format_args_string(grading_context_columns, eval_values, indx) # 13.2μs -> 12.9μs (2.28% faster)

def test_large_scale_empty_values():
    # All columns present but values are empty lists
    grading_context_columns = ["foo", "bar"]
    eval_values = {"foo": [], "bar": []}
    indx = 0
    with pytest.raises(IndexError):
        _format_args_string(grading_context_columns, eval_values, indx) # 2.27μs -> 2.31μs (1.73% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_format_args_string-mhx4fh7i and push.

The optimized code achieves a **6% speedup** through several key micro-optimizations that reduce Python interpreter overhead: **What specific optimizations were applied:** 1. **Eliminated redundant dictionary lookups** - Replaced `if arg in eval_values:` check followed by `eval_values[arg]` access with a single `try/except KeyError` pattern, avoiding the double lookup cost. 2. **Cached attribute access** - Stored `pd.Series` as `pd_Series` to avoid repeated module attribute lookups in the type checking loop. 3. **Reduced variable access overhead** - Created local references (`columns`, `values`) to function parameters to speed up variable resolution in the loop. 4. **Simplified empty dictionary check** - Replaced `args_dict is None or len(args_dict) == 0` with the more efficient `not args_dict` (the None check was redundant since `args_dict` is always initialized as `{}`). 5. **Streamlined return logic** - Eliminated unnecessary nested conditionals and parentheses in the final return statement. **Why these optimizations lead to speedup:** In Python, dictionary key lookups (`in` operator + `[]` access) and attribute resolution (`pd.Series`) are relatively expensive operations. The line profiler shows the biggest time saver comes from reducing the `eval_values[arg].iloc[indx]` and `isinstance(eval_values[arg], pd.Series)` overhead (52.6% → 50.7% of total time). The `try/except` pattern is faster than `in` checks because it avoids the double hash table lookup when keys exist (the common case). **How this impacts existing workloads:** Based on the function references, `_format_args_string` is called within a loop in `eval_fn` for each prediction being evaluated (`for indx, (input, output) in enumerate(zip(inputs, outputs))`). This makes it a hot path function where even small optimizations compound significantly. The 6% improvement per call translates to meaningful speedup when processing large batches of LLM evaluations. **Test case performance patterns:** The optimizations show best results on large-scale test cases: - **Large column counts**: 16.1% faster with 100 columns, 18.7% faster with 999 columns - **Mixed data types**: Consistent 1-3% improvements across Series/list combinations - **Basic cases**: 8-11% improvements on simple scenarios The performance gains scale with the number of columns being processed, making this optimization particularly valuable for comprehensive LLM evaluations with many grading context columns.

codeflash-ai bot requested a review from mashraf-222 November 13, 2025 07:42

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_format_args_string` by 6% #173

⚡️ Speed up function `_format_args_string` by 6% #173

Uh oh!

codeflash-ai bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _format_args_string by 6% #173

Are you sure you want to change the base?

⚡️ Speed up function _format_args_string by 6% #173

Uh oh!

Conversation

codeflash-ai bot commented Nov 13, 2025

📄 6% (0.06x) speedup for _format_args_string in mlflow/metrics/genai/genai_metric.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_format_args_string` by 6% #173

⚡️ Speed up function `_format_args_string` by 6% #173

📄 6% (0.06x) speedup for `_format_args_string` in `mlflow/metrics/genai/genai_metric.py`