⚡️ Speed up method EvaluationDataset._to_pyfunc_dataset by 15%
#171
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 15% (0.15x) speedup for
EvaluationDataset._to_pyfunc_datasetinmlflow/genai/datasets/evaluation_dataset.py⏱️ Runtime :
16.7 milliseconds→14.5 milliseconds(best of20runs)📝 Explanation and details
The optimized code achieves a 15% speedup through two key optimizations that reduce expensive repeated operations:
1. Attribute Lookup Caching
The original code calls
self.nameandself.digestevery timeto_evaluation_dataset()is invoked. Based on the read-only dependency code, these trigger the__getattr__method which performs dynamic attribute delegation - checking both_mlflow_datasetand_databricks_datasetwithhasattr()andgetattr()calls. The optimization pre-fetches and caches these values as_cached_nameand_cached_digestduring initialization, eliminating ~8.3% of runtime spent on attribute access (7.82ms → 4.4ns in the profiler).2. Import Statement Caching
The original code imports
LegacyEvaluationDataseton every method call. While the import itself is fast, the optimization caches the imported class asself._legacy_eval_clsafter the first use, avoiding repeated import overhead. This is particularly beneficial when the method is called multiple times.Performance Impact by Test Case
The optimizations show the greatest benefit for:
The optimizations are most effective when
to_evaluation_dataset()is called frequently (common in evaluation loops) or when the underlying dataset's attribute access is expensive due to the delegation pattern. The caching approach maintains full behavioral compatibility while eliminating redundant computations.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-EvaluationDataset._to_pyfunc_dataset-mhx2xczxand push.