Problem
LLM rubrics are semantic graders, but the current authoring surface makes richer rubric shapes awkward. Code graders can receive structured pass-through config objects, while llm-grader usage is mostly centered on prompt strings, criteria, and AgentV's built-in rubric object shape.
This creates friction for datasets like Dexter that already carry rubric metadata as structured objects, for example { operator, criteria }, where correctness and contradiction need different prompt semantics without adding dataset-specific core fields.
Desired DX
Make llm-grader feel as composable as code graders for semantic grading:
- allow a custom prompt/template to receive a structured custom input object
- preserve arbitrary rubric metadata in a documented config field, rather than forcing it into built-in rubric fields
- support template variables for both common text fields and serialized structured data
- keep the core primitive generic, with no Dexter-specific schema or operator semantics baked into AgentV
Acceptance Criteria
- An eval author can write an
llm-grader assertion with a custom prompt and a structured input object.
- The prompt can reference that object through a documented template variable, likely as JSON.
- Existing
llm-grader behavior remains backwards compatible.
- Built-in rubrics remain available for standard checklist/score-range grading.
- Docs/examples show when to use built-in rubrics versus a custom prompt + structured input object.
- A regression test covers a custom
llm-grader prompt receiving structured rubric metadata.
Notes
This is motivated by the Dexter financial eval conversion. Dexter-style rubrics should be representable as data passed to an LLM grader prompt, rather than using unsupported fields on AgentV built-in rubric objects or routing semantic grading through a code-grader shim.
Problem
LLM rubrics are semantic graders, but the current authoring surface makes richer rubric shapes awkward. Code graders can receive structured pass-through config objects, while
llm-graderusage is mostly centered on prompt strings, criteria, and AgentV's built-in rubric object shape.This creates friction for datasets like Dexter that already carry rubric metadata as structured objects, for example
{ operator, criteria }, wherecorrectnessandcontradictionneed different prompt semantics without adding dataset-specific core fields.Desired DX
Make
llm-graderfeel as composable as code graders for semantic grading:Acceptance Criteria
llm-graderassertion with a custom prompt and a structured input object.llm-graderbehavior remains backwards compatible.llm-graderprompt receiving structured rubric metadata.Notes
This is motivated by the Dexter financial eval conversion. Dexter-style rubrics should be representable as data passed to an LLM grader prompt, rather than using unsupported fields on AgentV built-in rubric objects or routing semantic grading through a code-grader shim.