New metric definitions for llama-3-3-70b as judge in Arena Hard benchmark by kmazrolina · Pull Request #1949 · IBM/unitxt

kmazrolina · 2025-10-27T13:32:39Z

New metric definitions for llama-3-3-70b as judge in Arena Hard benchmark

Added metric definitions for llama-3-3-70b as judge in Arena Hard benchmark supporting:
- WML Inference Engine
- Generic Inference Engine

…mark * Added metric definitions for llama-3-3-70b as judge in Arena Hard benchmark supporting: - WML Inference Engine - Generic Inference Engine Signed-off-by: karolina.zrobek <[email protected]>

New metric definitions for llama-3-3-70b as judge in Arena Hard bench…

ee613cf

…mark * Added metric definitions for llama-3-3-70b as judge in Arena Hard benchmark supporting: - WML Inference Engine - Generic Inference Engine Signed-off-by: karolina.zrobek <[email protected]>

kmazrolina marked this pull request as draft October 27, 2025 13:46

kmazrolina marked this pull request as ready for review October 27, 2025 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New metric definitions for llama-3-3-70b as judge in Arena Hard benchmark#1949

New metric definitions for llama-3-3-70b as judge in Arena Hard benchmark#1949
kmazrolina wants to merge 1 commit intoIBM:mainfrom
kmazrolina:llm-as-judge-metric-update

kmazrolina commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kmazrolina commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant