Skip to content

Conversation

@vincentkoc
Copy link
Member

@vincentkoc vincentkoc commented Nov 25, 2025

Details

This PR centralizes dataset sampling, seeding, and shuffling logic across the opik_optimizer SDK to ensure consistent and reproducible evaluation runs. It introduces new utility modules (rng.py and sampling.py) that provide deterministic random number generation and unified sampling plan resolution, replacing ad-hoc implementations scattered across different optimizers.

Key changes include:

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-0000

Testing

Local

Documentation

tbd

Copilot AI review requested due to automatic review settings November 25, 2025 21:05
@vincentkoc vincentkoc requested review from a team and dsblank as code owners November 25, 2025 21:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes dataset sampling, seeding, and shuffling logic across the opik_optimizer SDK to ensure consistent and reproducible evaluation runs. It introduces new utility modules (rng.py and sampling.py) that provide deterministic random number generation and unified sampling plan resolution, replacing ad-hoc implementations scattered across different optimizers.

Key changes include:

  • New utils/rng.py module providing deterministic RNG helpers with seed derivation and batching
  • New utils/sampling.py module for consistent dataset sampling plan resolution
  • Base optimizer enhancements to support centralized sampling and RNG derivation
  • Updates to all optimizer algorithms to use the new centralized utilities

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
sdks/opik_optimizer/src/opik_optimizer/utils/rng.py Adds deterministic RNG utilities including seed hashing, child RNG derivation, and batching with optional shuffling
sdks/opik_optimizer/src/opik_optimizer/utils/sampling.py Implements sampling plan resolution with support for explicit IDs, sample counts, and dataset size clamping
sdks/opik_optimizer/src/opik_optimizer/base_optimizer.py Integrates centralized RNG and sampling plan preparation methods into the base optimizer class
sdks/opik_optimizer/src/opik_optimizer/task_evaluator.py Adds compatibility wrapper for upcoming evaluate_on_dict_items SDK feature
sdks/opik_optimizer/src/opik_optimizer/algorithms/meta_prompt_optimizer/ops/evaluation_ops.py Migrates to centralized sampling plan and deterministic RNG for evaluation subset selection
sdks/opik_optimizer/src/opik_optimizer/algorithms/evolutionary_optimizer/evolutionary_optimizer.py Removes global random seeding in favor of optimizer-scoped RNG derivation
sdks/opik_optimizer/src/opik_optimizer/algorithms/few_shot_bayesian_optimizer/few_shot_bayesian_optimizer.py Replaces local _make_rng method with centralized rng_utils.make_rng
sdks/opik_optimizer/src/opik_optimizer/algorithms/gepa_optimizer/gepa_optimizer.py Adopts sampling plan approach for train/validation splits with deterministic ID selection
sdks/opik_optimizer/src/opik_optimizer/utils/dataset_utils.py Adds resilience for dataset size mismatches and early-exit optimization when skipping existing datasets
sdks/opik_optimizer/tests/unit/test_*.py Comprehensive unit tests for new RNG and sampling utilities


logger = logging.getLogger(__name__)

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bump required opik version

return 1.0


def test_evaluate_passes_ids_and_samples(monkeypatch: Any) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, such a complex unit test setup to only check that the argument is transferred correctly to another function is a symptom of overloaded functions.
It's not a part of this PR specifically, but after looking at the test and at the function I'd say that this could be moved out to another python module and tested separately (we can also turn task_evaluator into the namespace for better grouping):

    # part of _evaluate_internal function

    if dataset_item_ids:
        # FIXME: In rare cases sometimes dataset ids are missing (cause unknown, skip those for now)
        available_ids = {item.get("id") for item in items}
        missing_ids = [
            item_id for item_id in dataset_item_ids if item_id not in available_ids
        ]
        if missing_ids:
            logger.warning(
                "Dropping %s dataset_item_ids not present in dataset %s (showing first 5): %s",
                len(missing_ids),
                getattr(dataset, "name", None) or "<unknown>",
                missing_ids[:5],
            )
        dataset_item_ids = [
            item_id for item_id in dataset_item_ids if item_id in available_ids
        ]
        if not dataset_item_ids:
            logger.warning(
                "All provided dataset_item_ids were missing; evaluating on full dataset instead."
            )
            dataset_item_ids = None
        else:
            items = [item for item in items if item.get("id") in dataset_item_ids]

P.S. I don't know why exactly we need this logic here at all, opik.evaluate_optimization_trial has already working dataset_items_ids, and dataset_sampler parameters. But maybe I'm not aware of some peculiar use case. Can we improve the behavior in the core SDK if we're missing something here?

from ....api_objects import chat_prompt
from .... import _llm_calls
from ...._llm_calls import StructuredOutputParsingError
from ....api_objects import chat_prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's more than 2 '..' it's better to just use absolute import from opik_optimizer.api_objects import chat_prompt

return random.Random(derived)


def sample_ids(rng: random.Random, ids: Sequence[T], k: int) -> list[T]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to make it sample_ids and not sample for anything :)


def resolve_sampling(
dataset: Any,
n_samples: int | str | None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use literals in such cases.

serialized_tools.append({k: v for k, v in tool.items() if k})
return serialized_tools

# ------------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseOptimizer is already overloaded and not easy to track. Maybe worth encapsulating the random sampling logic to a separate class with a compact API rather than continuing to grow the base class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants