arxiv:2403.02839

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Published on Mar 5, 2024

Authors:

Abstract

Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-source models, especially GPT4, as the evaluator. Alternatively, other works have fine-tuned <PRE_TAG>judge models</POST_TAG> based on open-source LLMs as the evaluator. In this study, we conduct an empirical study of different judge models on their evaluation capability. Our findings indicate that although the fine-tuned <PRE_TAG>judge models</POST_TAG> achieve high accuracy on in-domain test sets, even surpassing GPT4, they are inherently task-specific classifiers, and their generalizability and fairness severely underperform GPT4.

May 31, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2403.02839 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2403.02839 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2403.02839 in a Space README.md to link it from this page.