Skip to content

Conversation

@Susan9001
Copy link
Contributor

Details

This PR is a follow-up to #4229 and makes DashScope Qwen more robust as a GEval judge model when used via LiteLLMChatModel.

Currently, when a model advertises logprobs and top_logprobs support, GEval enables the logprobs-aware scoring path. For DashScope Qwen this can occasionally lead to MetricComputationError("Failed to calculate g-eval score") because the returned logprobs do not always match the OpenAI-style format expected by the parser.

This PR treats DashScope Qwen as not logprobs-supported in this context, so GEval falls back to the standard text/JSON-based parsing path instead of relying on logprobs.

Change checklist

  • User facing
  • Documentation update

Issues

Testing

Locally:

  • pytest tests/unit/evaluation/models/test_litellm_chat_model.py
  • Ran more examples with dashscope/qwen-flash as the judge model with code snippets:
          self.judge_model = models.LiteLLMChatModel(
              model_name=judge_model_name,
              api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
              api_key=os.getenv("DASHSCOPE_API_KEY"),
          )
    All samples now score successfully without Failed to calculate g-eval score.

@Susan9001 Susan9001 requested a review from a team as a code owner November 28, 2025 20:37
@yaricom
Copy link
Contributor

yaricom commented Nov 30, 2025

Hi @Susan9001 ! Thank you for a contribution! Please fix merge conflicts with current branch.

Cheers,
Iaroslav

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score

2 participants