LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?

What exactly is being measured when a judge LLM assigns a 1–5 (or pairwise) score? Most “correctness/faithfulness/completeness” rubrics are project-specific. Without task-grounded definitions, a scalar score can drift from business outcomes (e.g., “useful marketing post” vs. “high completeness”). Surveys of LLM-as-a-judge (LAJ) note that rubric ambiguity and prompt template choices materially shift scores and human…

Read More