AgentV companion eval project for a public coding/web financial research agent.
This repository is not a fork of Dexter and does not own Dexter's agent code or dataset. It uses Dexter's public src/evals/ dataset as a pinned benchmark fixture and golden-answer source so the AgentV Dashboard can show a realistic public domain-agent project.
The first public demo is pinned to Dexter commit:
8d9419829f443f84b804d033bb2c3b1fbd788629
Dexter's own eval flow at that commit uses:
bun run src/evals/run.ts- optional sampling with
--sample N src/evals/dataset/finance_agent.csv- CSV columns:
Question,Answer,Question Type,Expert time (mins),Rubric - an LLM-as-judge correctness check, with CSV rubric metadata containing
correctnessandcontradictioncriteria
The committed AgentV eval keeps the question/answer fixture shape for every row in the pinned CSV: Dexter questions become AgentV input, and Dexter answers become expected_output. Dexter's runtime evaluator ignores the CSV Rubric column, but this project intentionally preserves those entries as native AgentV llm-grader rubrics. The shared prompt in prompts/dexter-grader.md receives AgentV's {{ rubrics_json }} and {{ metadata_json }} structured variables, so the eval does not duplicate question/answer data into grader-only payloads.
By default, the eval does not run Dexter. It runs a coding/web research agent against Dexter's public golden answers, so the demo does not require FINANCIAL_DATASETS_API_KEY. The real dexter-agent target remains available as an optional compatibility target for users who have Dexter's paid data prerequisites configured.
Install AgentV separately.
For the default financial-research-agent target, configure a Codex-style coding agent plus a grader:
AGENT_TARGET=financial-research-agent
CODEX_EXECUTABLE=codex-eng
CODEX_MODEL=gpt-5.5
CODEX_REASONING_EFFORT=low
CODEX_WORKSPACE_DIR=.agentv/codex-workspaces
CODEX_LOG_DIR=.agentv/logs/codex
GRADER_TARGET=openai-grader
OPENAI_API_KEY=...
OPENAI_MODEL=gpt-5.5Clone and pin Dexter only when regenerating eval YAML from Dexter's CSV or when running the optional real dexter-agent target:
git clone https://github.com/virattt/dexter.git ../dexter
git -C ../dexter checkout 8d9419829f443f84b804d033bb2c3b1fbd788629
cd ../dexter
bun installCreate local env for this project:
cp .env.example .envFill in only local values in .env. Do not commit .env, resolved provider endpoints, API keys, Bitwarden output, or result-repo tokens.
Required variables for the default public-demo target:
AGENT_TARGET=financial-research-agentCODEX_EXECUTABLECODEX_MODELCODEX_WORKSPACE_DIRCODEX_LOG_DIRGRADER_TARGET- grader model variables for the selected grader target
- for
GRADER_TARGET=azure:AZURE_OPENAI_RESPONSES_BASE_URL,AZURE_OPENAI_API_KEY, andAZURE_DEPLOYMENT_NAME
Additional variables for optional AGENT_TARGET=dexter-agent:
DEXTER_REPO_PATHOPENAI_API_KEYFINANCIAL_DATASETS_API_KEYEXASEARCH_API_KEYorTAVILY_API_KEY
Preflight:
bun run setupRun the full AgentV eval:
agentv eval evals/financial-research-agent.eval.yaml --targets .agentv/targets.yaml --target financial-research-agentDuring AgentV repository development, prefer the source CLI from the AgentV checkout:
bun /path/to/agentv/apps/cli/src/cli.ts eval financial-research-agent/evals/financial-research-agent.eval.yaml --targets financial-research-agent/.agentv/targets.yaml --target financial-research-agentFor quick verification, run one committed test by ID:
agentv eval evals/financial-research-agent.eval.yaml --targets .agentv/targets.yaml --target financial-research-agent --test-id us-steel-nippon-mergerTo run the real Dexter agent instead, use --target dexter-agent after setting
the optional Dexter variables above.
After updating DEXTER_REPO_PATH and DEXTER_COMMIT, regenerate the full AgentV eval from Dexter's public CSV:
bun run scripts/generate-eval-from-dexter.ts --out evals/financial-research-agent.eval.yamlUse --sample N --out <path> only for local experiments or quick generator checks; do not use a sampled file as the committed dataset boundary.
Review the generated eval before committing. The generator intentionally keeps the conversion conservative and AgentV-native: it preserves Dexter rubric entries as { operator, criteria }-style llm-grader rubric items, uses suite-level source metadata for the pinned CSV, and reuses prompts/dexter-grader.md by file reference.
Setup and target scripts print variable names and missing prerequisite guidance only. They must not print resolved secret values, private endpoints, or Bitwarden-derived output.
Public result synchronization belongs to the downstream financial-research-agent-evals work. Before publishing any run artifact, scan it for API keys, provider endpoints, private paths, and sensitive data.
The Dexter adaptation uses AgentV's native llm-grader primitive. Each assertion references prompts/dexter-grader.md and passes Dexter CSV rubric entries through rubrics, preserving operator plus criteria so the prompt can distinguish correctness checks from contradiction guards. Suite-level metadata carries the pinned Dexter source fields, while per-test metadata only carries row-specific fields such as source_row, question_type, and expert_time_mins.