Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 19% (0.19x) speedup for _calculate_npmi_core in mlflow/store/analytics/trace_correlation.py

⏱️ Runtime : 142 microseconds 120 microseconds (best of 90 runs)

📝 Explanation and details

The optimized code achieves an 18% speedup through three key micro-optimizations that reduce Python's overhead:

What was optimized:

  1. Eliminated redundant max(-1.0, min(1.0, npmi)) call - Replaced the nested function call (which creates intermediate values) with a faster if/elif/else branch that directly returns the clamped value
  2. Simplified denominator calculation - Changed -(log_n11 - log_N) to log_N - log_n11, eliminating the unary negation operation
  3. Added intermediate variables for log sums - Pre-computed log_n11_plus_log_N and log_n1_plus_log_n2 to reduce repeated arithmetic operations in the PMI calculation

Why it's faster:
The original max(-1.0, min(1.0, npmi)) creates two function call frames and an intermediate value, while the optimized branching logic performs direct comparisons. Python's function call overhead is significant for such simple operations. The line profiler shows this optimization saves ~30 nanoseconds per call (from 41127ns to various branch times).

Performance characteristics from tests:

  • Best gains (20-30% faster): Tests with normal NPMI calculations that reach the clamping logic, like test_basic_independent_events (28.9% faster)
  • Minimal impact: Edge cases that return early (NaN, perfect co-occurrence) show <5% difference since they bypass the optimized sections
  • Consistent improvement: Large-scale tests consistently show 20-25% speedups

Impact on workloads:
Given that _calculate_npmi_core is called from calculate_npmi_from_counts and calculate_smoothed_npmi in trace correlation analysis, this optimization will provide meaningful speedups for MLflow's analytics pipeline, especially when processing large numbers of trace correlations where this function is called repeatedly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 56 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import math

# imports
import pytest
from mlflow.store.analytics.trace_correlation import _calculate_npmi_core

# unit tests

# --- Basic Test Cases ---

def test_basic_perfect_cooccurrence():
    # Both events always occur together, no smoothing
    codeflash_output = _calculate_npmi_core(10, 0, 0, 0) # 993ns -> 995ns (0.201% slower)

def test_basic_independent_events():
    # Events are independent: P(x,y) = P(x)*P(y)
    # n11=25, n10=25, n01=25, n00=25: uniform contingency table
    codeflash_output = _calculate_npmi_core(25, 25, 25, 25); result = codeflash_output # 3.57μs -> 2.77μs (28.9% faster)

def test_basic_negative_association():
    # Events never co-occur: n11=0, n10=10, n01=10, n00=10
    codeflash_output = _calculate_npmi_core(0, 10, 10, 10); result = codeflash_output # 1.61μs -> 1.68μs (3.87% slower)

def test_basic_positive_association():
    # Events co-occur more than expected by chance
    codeflash_output = _calculate_npmi_core(20, 5, 5, 20); result = codeflash_output # 3.56μs -> 2.77μs (28.5% faster)

def test_basic_negative_npmi():
    # Events co-occur less than expected by chance
    codeflash_output = _calculate_npmi_core(2, 18, 18, 2); result = codeflash_output # 3.35μs -> 2.64μs (26.9% faster)

def test_basic_with_smoothing():
    # Smoothing should allow calculation even when n11=0
    codeflash_output = _calculate_npmi_core(0, 10, 10, 10, smoothing=1); result = codeflash_output # 3.61μs -> 2.93μs (23.2% faster)

def test_basic_symmetric_table():
    # Symmetric table, n11=n00, n10=n01
    codeflash_output = _calculate_npmi_core(15, 5, 5, 15); result = codeflash_output # 3.36μs -> 2.56μs (31.1% faster)

# --- Edge Test Cases ---

def test_edge_all_zeros():
    # All counts zero, undefined
    codeflash_output = _calculate_npmi_core(0, 0, 0, 0); result = codeflash_output # 1.51μs -> 1.58μs (3.93% slower)

def test_edge_zero_marginals():
    # Marginals zero, e.g. n11=0, n10=0, n01=0, n00=10
    codeflash_output = _calculate_npmi_core(0, 0, 0, 10); result = codeflash_output # 1.54μs -> 1.57μs (1.59% slower)

def test_edge_only_one_cell_nonzero():
    # Only n10 is nonzero
    codeflash_output = _calculate_npmi_core(0, 10, 0, 0); result = codeflash_output # 1.67μs -> 1.62μs (3.21% faster)
    # Only n01 is nonzero
    codeflash_output = _calculate_npmi_core(0, 0, 10, 0); result = codeflash_output # 554ns -> 541ns (2.40% faster)
    # Only n00 is nonzero
    codeflash_output = _calculate_npmi_core(0, 0, 0, 10); result = codeflash_output # 408ns -> 378ns (7.94% faster)

def test_edge_negative_counts():
    # Negative counts should be treated as invalid (should return NaN)
    codeflash_output = _calculate_npmi_core(-1, 10, 10, 10); result = codeflash_output # 1.67μs -> 1.61μs (4.05% faster)
    codeflash_output = _calculate_npmi_core(10, -1, 10, 10); result = codeflash_output # 3.02μs -> 2.34μs (28.8% faster)
    codeflash_output = _calculate_npmi_core(10, 10, -1, 10); result = codeflash_output # 1.21μs -> 1.04μs (16.5% faster)
    codeflash_output = _calculate_npmi_core(10, 10, 10, -1); result = codeflash_output # 898ns -> 823ns (9.11% faster)

def test_edge_smoothing_all_zeros():
    # Smoothing all zeros should allow calculation
    codeflash_output = _calculate_npmi_core(0, 0, 0, 0, smoothing=1); result = codeflash_output # 1.22μs -> 1.20μs (1.42% faster)

def test_edge_smoothing_perfect_cooccurrence():
    # Smoothing breaks perfect co-occurrence
    codeflash_output = _calculate_npmi_core(10, 0, 0, 0, smoothing=1); result = codeflash_output # 1.19μs -> 1.20μs (0.835% slower)

def test_edge_large_smoothing():
    # Large smoothing dominates counts
    codeflash_output = _calculate_npmi_core(1, 1, 1, 1, smoothing=100); result = codeflash_output # 3.92μs -> 3.14μs (24.6% faster)

def test_edge_float_inputs():
    # Float inputs should work
    codeflash_output = _calculate_npmi_core(10.5, 5.5, 5.5, 10.5); result = codeflash_output # 3.50μs -> 2.73μs (28.3% faster)

def test_edge_extremely_small_counts():
    # Very small counts, but positive
    codeflash_output = _calculate_npmi_core(1e-10, 1e-10, 1e-10, 1e-10); result = codeflash_output # 3.41μs -> 2.68μs (27.2% faster)

# --- Large Scale Test Cases ---

def test_large_scale_balanced():
    # Balanced large table
    codeflash_output = _calculate_npmi_core(250, 250, 250, 250); result = codeflash_output # 3.45μs -> 2.72μs (26.8% faster)

def test_large_scale_perfect_cooccurrence():
    # Large perfect co-occurrence
    codeflash_output = _calculate_npmi_core(999, 0, 0, 0) # 1.09μs -> 1.14μs (4.31% slower)

def test_large_scale_no_cooccurrence():
    # Large table, no co-occurrence
    codeflash_output = _calculate_npmi_core(0, 500, 500, 0); result = codeflash_output # 1.90μs -> 1.85μs (2.87% faster)

def test_large_scale_high_positive_association():
    # Large table, strong positive association
    codeflash_output = _calculate_npmi_core(800, 100, 100, 0); result = codeflash_output # 3.61μs -> 2.96μs (22.0% faster)

def test_large_scale_high_negative_association():
    # Large table, strong negative association
    codeflash_output = _calculate_npmi_core(10, 490, 490, 10); result = codeflash_output # 3.63μs -> 2.89μs (25.8% faster)

def test_large_scale_with_smoothing():
    # Large table, with smoothing
    codeflash_output = _calculate_npmi_core(0, 500, 500, 0, smoothing=1); result = codeflash_output # 3.89μs -> 3.20μs (21.4% faster)

def test_large_scale_extreme_counts():
    # Extreme counts, all cells filled
    codeflash_output = _calculate_npmi_core(999, 999, 999, 999); result = codeflash_output # 3.50μs -> 2.81μs (24.2% faster)

def test_large_scale_float_counts():
    # Float counts in large table
    codeflash_output = _calculate_npmi_core(500.5, 499.5, 499.5, 500.5); result = codeflash_output # 3.53μs -> 2.74μs (28.6% faster)

def test_large_scale_edge_smoothing():
    # Large table, large smoothing
    codeflash_output = _calculate_npmi_core(10, 10, 10, 10, smoothing=500); result = codeflash_output # 3.74μs -> 3.01μs (24.5% faster)

def test_large_scale_performance():
    # Performance: ensure function runs quickly for large input
    # Not an assert, but a timeout check
    import time
    start = time.time()
    _calculate_npmi_core(500, 250, 250, 500) # 3.65μs -> 2.88μs (26.5% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import math

# imports
import pytest
from mlflow.store.analytics.trace_correlation import _calculate_npmi_core

# unit tests

# --- BASIC TEST CASES ---

def test_perfect_cooccurrence():
    # Both events always occur together, no smoothing
    codeflash_output = _calculate_npmi_core(10, 0, 0, 0) # 972ns -> 1.00μs (2.80% slower)

def test_perfect_independence():
    # Events are independent: P(x,y) = P(x)*P(y)
    # For example, n11=25, n10=25, n01=25, n00=25 (all combinations equally likely)
    codeflash_output = _calculate_npmi_core(25, 25, 25, 25); result = codeflash_output # 3.58μs -> 2.80μs (27.9% faster)

def test_perfect_negative_association():
    # Events never co-occur: n11 = 0, but both marginals > 0
    # n10=10, n01=10, n00=10
    codeflash_output = _calculate_npmi_core(0, 10, 10, 10); result = codeflash_output # 1.66μs -> 1.71μs (3.09% slower)

def test_typical_case():
    # Typical case: some co-occurrence, some independent
    # n11=30, n10=20, n01=10, n00=40
    codeflash_output = _calculate_npmi_core(30, 20, 10, 40); result = codeflash_output # 3.47μs -> 2.80μs (24.0% faster)

def test_smoothing_basic():
    # Smoothing should allow calculation even if n11=0
    codeflash_output = _calculate_npmi_core(0, 10, 10, 10, smoothing=1); result = codeflash_output # 3.69μs -> 3.06μs (20.7% faster)

# --- EDGE TEST CASES ---

def test_all_zeros():
    # All counts zero: undefined, should return NaN
    codeflash_output = _calculate_npmi_core(0, 0, 0, 0); result = codeflash_output # 1.43μs -> 1.50μs (4.46% slower)

def test_single_event_only():
    # Only one event ever occurs: n11=0, n10=10, n01=0, n00=0
    # Marginals: n1 = 10, n2 = 0, should return NaN
    codeflash_output = _calculate_npmi_core(0, 10, 0, 0); result = codeflash_output # 1.60μs -> 1.56μs (2.50% faster)

def test_negative_counts():
    # Negative counts are invalid, should return NaN
    codeflash_output = _calculate_npmi_core(-1, 10, 10, 10); result = codeflash_output # 1.60μs -> 1.65μs (2.43% slower)
    codeflash_output = _calculate_npmi_core(10, -1, 10, 10); result = codeflash_output # 2.98μs -> 2.30μs (29.3% faster)
    codeflash_output = _calculate_npmi_core(10, 10, -1, 10); result = codeflash_output # 1.18μs -> 1.10μs (7.57% faster)
    codeflash_output = _calculate_npmi_core(10, 10, 10, -1); result = codeflash_output # 860ns -> 779ns (10.4% faster)

def test_zero_marginals_with_smoothing():
    # All zeros, but with smoothing, should be computable
    codeflash_output = _calculate_npmi_core(0, 0, 0, 0, smoothing=1); result = codeflash_output # 1.23μs -> 1.30μs (5.69% slower)

def test_extreme_smoothing():
    # Large smoothing, should still clamp to [-1, 1]
    codeflash_output = _calculate_npmi_core(1, 1, 1, 1, smoothing=100); result = codeflash_output # 3.79μs -> 3.17μs (19.3% faster)

def test_extreme_values():
    # Very large counts, check for numerical stability
    codeflash_output = _calculate_npmi_core(1e6, 1e6, 1e6, 1e6); result = codeflash_output # 3.42μs -> 2.70μs (26.7% faster)

def test_float_counts():
    # Floating point counts
    codeflash_output = _calculate_npmi_core(2.5, 3.5, 1.5, 4.5); result = codeflash_output # 3.49μs -> 2.69μs (29.8% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_scale_balanced():
    # Large, balanced table
    n = 500
    codeflash_output = _calculate_npmi_core(n, n, n, n); result = codeflash_output # 3.41μs -> 2.77μs (23.0% faster)

def test_large_scale_skewed():
    # Large, skewed table: strong co-occurrence
    codeflash_output = _calculate_npmi_core(900, 50, 50, 0); result = codeflash_output # 3.64μs -> 2.89μs (26.1% faster)

def test_large_scale_no_cooccurrence():
    # Large, but no co-occurrence
    codeflash_output = _calculate_npmi_core(0, 500, 500, 0); result = codeflash_output # 1.79μs -> 1.85μs (2.76% slower)

def test_large_scale_with_smoothing():
    # Large table, smoothing applied
    codeflash_output = _calculate_npmi_core(0, 500, 500, 0, smoothing=1); result = codeflash_output # 4.01μs -> 3.21μs (25.1% faster)

def test_large_scale_extreme_counts():
    # All counts near upper limit
    codeflash_output = _calculate_npmi_core(999, 1, 1, 1); result = codeflash_output # 3.50μs -> 2.85μs (22.7% faster)

def test_large_scale_extreme_negative():
    # All counts except n11 are large
    codeflash_output = _calculate_npmi_core(1, 999, 999, 1); result = codeflash_output # 3.56μs -> 2.94μs (21.3% faster)

# --- DETERMINISM TESTS ---

def test_repeatability():
    # The function should be deterministic for same input
    codeflash_output = _calculate_npmi_core(10, 20, 30, 40); out1 = codeflash_output # 3.41μs -> 2.75μs (23.8% faster)
    codeflash_output = _calculate_npmi_core(10, 20, 30, 40); out2 = codeflash_output # 1.22μs -> 1.12μs (9.23% faster)

def test_repeatability_with_smoothing():
    # Determinism with smoothing
    codeflash_output = _calculate_npmi_core(10, 20, 30, 40, smoothing=2); out1 = codeflash_output # 3.66μs -> 3.00μs (22.0% faster)
    codeflash_output = _calculate_npmi_core(10, 20, 30, 40, smoothing=2); out2 = codeflash_output # 1.41μs -> 1.25μs (13.2% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from mlflow.store.analytics.trace_correlation import _calculate_npmi_core

def test__calculate_npmi_core():
    _calculate_npmi_core(float('inf'), 0.0, -1.0, 0.0, smoothing=0.0)

def test__calculate_npmi_core_2():
    _calculate_npmi_core(float('nan'), 0.0, 0.0, 0.0, smoothing=float('inf'))

def test__calculate_npmi_core_3():
    _calculate_npmi_core(0.0, 0.5, 0.0, 0.0, smoothing=0.0)

To edit these changes git checkout codeflash/optimize-_calculate_npmi_core-mhx1tj3f and push.

Codeflash Static Badge

The optimized code achieves an **18% speedup** through three key micro-optimizations that reduce Python's overhead:

**What was optimized:**
1. **Eliminated redundant `max(-1.0, min(1.0, npmi))` call** - Replaced the nested function call (which creates intermediate values) with a faster `if/elif/else` branch that directly returns the clamped value
2. **Simplified denominator calculation** - Changed `-(log_n11 - log_N)` to `log_N - log_n11`, eliminating the unary negation operation
3. **Added intermediate variables for log sums** - Pre-computed `log_n11_plus_log_N` and `log_n1_plus_log_n2` to reduce repeated arithmetic operations in the PMI calculation

**Why it's faster:**
The original `max(-1.0, min(1.0, npmi))` creates two function call frames and an intermediate value, while the optimized branching logic performs direct comparisons. Python's function call overhead is significant for such simple operations. The line profiler shows this optimization saves ~30 nanoseconds per call (from 41127ns to various branch times).

**Performance characteristics from tests:**
- **Best gains (20-30% faster)**: Tests with normal NPMI calculations that reach the clamping logic, like `test_basic_independent_events` (28.9% faster)
- **Minimal impact**: Edge cases that return early (NaN, perfect co-occurrence) show <5% difference since they bypass the optimized sections
- **Consistent improvement**: Large-scale tests consistently show 20-25% speedups

**Impact on workloads:**
Given that `_calculate_npmi_core` is called from `calculate_npmi_from_counts` and `calculate_smoothed_npmi` in trace correlation analysis, this optimization will provide meaningful speedups for MLflow's analytics pipeline, especially when processing large numbers of trace correlations where this function is called repeatedly.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 06:29
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant