Skip to content

Conversation

@vincentkoc
Copy link
Member

@vincentkoc vincentkoc commented Nov 24, 2025

Details

This PR provides first fully native multiagent optimization support end-to-end designed for HotpotQA dataset benchmarks but can be replicated on other datastes.

  • Introduces new utils/tools/wikipedia.py module with unified search interface supporting API, ColBERT, and BM25 backends
  • Adds BM25 index building and optimization tooling with Parquet compression (40-50% size reduction) using newly minted dataset https://huggingface.co/datasets/Comet/wikipedia-2017-bm25
  • Provides a factory for creating sequential agents for benchmarks SequencedOptimizableAgent(OptimizableAgent) and class HotpotMultiHopAgent which inherts this factory to create a multi-agent system.
  • Adapted the ChatPrompt object to be a dict[ChatPrompt] allowing for nested "bundle" of prompts used by Agents

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-0000

Testing

Locally verified

Documentation

TBD

Screenshots

Screenshot 2025-11-24 at 15 38 49 Screenshot 2025-11-24 at 15 40 28

@vincentkoc vincentkoc marked this pull request as ready for review November 24, 2025 09:08
@vincentkoc vincentkoc requested review from a team and dsblank as code owners November 24, 2025 09:08
Copilot AI review requested due to automatic review settings November 24, 2025 09:08

This comment was marked as outdated.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 39 changed files in this pull request and generated 6 comments.

@vincentkoc vincentkoc requested a review from Copilot November 24, 2025 23:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 45 changed files in this pull request and generated 7 comments.

Comment on lines +578 to +582
# NOTE: Some models return fenced JSON blocks or prepend/append prose.
# Collect likely JSON snippets and try them in order of size.
candidates: list[str] = []

# Try fenced ```json ... ``` blocks
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r\"```json\\s*(\\{.*?\\})\\s*```\" is non-greedy (.*?) which will only match the smallest JSON object within fenced blocks. For complex nested JSON, this could fail to capture the entire structure. Consider using a greedy pattern or a proper brace-counting approach.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jverre will get you to eyeball issue is i get three-backtick with json on the responses so trying to be lax on response struct.

vincentkoc and others added 17 commits November 24, 2025 16:01
…optimizer/meta_prompt_optimizer.py

Co-authored-by: Copilot <[email protected]>
…l/opik into vk/optimizer-bm25-hotpot

* 'vk/optimizer-bm25-hotpot' of https://github.com/comet-ml/opik:
  [NA] [BE] Performant ClickHouse bulk update queries to ensure latest row selection (#4206)
  [issue-1239] [DOCS] Improve console logging level documentation (#4201)
  [OPIK-3110] [BE/FE] Add threshold support for trace and thread feedback score alerts (#4168)
  [NA] [DOCS] Fix Update experiment docs (#4199)
  [issue-2572] [FE] Update experiment name metadata in UI (#3136)
  Bump opentelmetry.version from 2.21.0 to 2.22.0 in /apps/opik-backend (#4194)
  [NA][SDK] Optimizer Benchmarks Modal Timeout (#4197)
  Bump org.apache.maven.plugins:maven-jar-plugin in /apps/opik-backend (#4190)
  [OPIK-3116][FE] user friendly error message for duplicate dataset name (#4188)
  [NA][BE] Cursor rules update for BE java code (#4155)
  Bump software.amazon.awssdk:bom in /apps/opik-backend (#4192)
  Bump com.diffplug.spotless:spotless-maven-plugin in /apps/opik-backend (#4193)
  [NA] [BE] Update model prices file (#4189)
…l/opik into vk/optimizer-bm25-hotpot

* 'vk/optimizer-bm25-hotpot' of https://github.com/comet-ml/opik:
  Update hotpot_multihop_benchmark.py
name: prompt.get_messages() for name, prompt in best_prompt.items()
}
first_prompt = next(iter(best_prompt.values()))
best_prompt_messages = first_prompt.get_messages()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have quite a bit of logic around getting the first prompt out of the bundle and using that, feels risky

return " ".join(parts).strip()

# Parallel-friendly task wrapper so we can use task_evaluator with num_threads.
def _evaluated_task(dataset_item: dict[str, Any]) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic might be overcomplicating things, could we not simply update the Optimizable agent class to receive the chat prompt or chat prompt bundle depending on what was passed to the optimize prompt method ? That way we keep everything contained to a custom agent class passed in optimize_prompt.

trace = run_result.get("trace") or {"system": _bundle_system_context()}
return {"llm_output": final_output_str, "trace": trace}

# Parallel evaluation with trace preservation; falls back to sequential if needed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need this, if we need to slow down the evaluations we can just reduce the thread parameter in evaluate method.

As we move to cost and latency optimization, being able to use the trace collection methods is going to be very useful. If the issue is orphan spans and traces we should first try to move to the internal litellm integration

return "\n\n".join(blocks)


def generate_candidate_prompts(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify all this, why not convert the single chat prompt optimization to an optimization with just one prompt on the bundle ? Public API stays the same but it would avoid a lot of duplication and if else statements throughout the code

}"""
)
if mode == "bundle"
else """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to able, all this is removed if we just convert single chat prompt optimization to a chat bundle or some 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants