LLM Judges¶
LLM judges provide qualitative analysis of your AI's brand voice alignment. They complement quantitative metrics with human-readable feedback.
Overview¶
While Alignmenter's core metrics (authenticity, safety, stability) are fast and deterministic, LLM judges add:
- Explanations: Why did a session score high or low?
- Nuance: Catches subtleties that formulas might miss
- Validation: Confirms metric accuracy against human-like judgment
When to Use Judges¶
✅ Good use cases: - Validating metric calibration - Diagnosing unexpected scores - Getting qualitative feedback for stakeholders - Research and experimentation
❌ Avoid judges for: - Every test run (expensive and slow) - Real-time monitoring (use metrics instead) - Large-scale batch processing (cost prohibitive)
Basic Usage¶
Validate Metrics¶
Check if LLM judge agrees with your quantitative scores:
Output:
Analyzing 12 sessions (20% sample) with LLM judge...
✓ Judge agreement with metrics: 87.5%
→ "Tone matches brand guidelines well, but occasionally too formal in casual contexts"
Cost: $0.032 (12 judge calls)
Diagnose Errors¶
Find sessions where judge disagrees with metrics:
Shows sessions with large score discrepancies for manual review.
Analyze Scenarios¶
Deep dive into specific test cases:
Cost Control¶
LLM judge calls can get expensive. Alignmenter provides multiple cost controls:
1. Sampling Strategies¶
Only analyze a subset of sessions:
# Random 20% sample
--judge-sample 0.2
# First N sessions
--judge-sample 10
# Stratified sampling (coming soon)
2. Budget Guardrails¶
Set maximum spend:
Alignmenter halts at 90% of budget to prevent overruns.
3. Cost Estimation¶
Preview costs before running:
Shows:
4. Smart Sampling¶
on_failure strategy only judges low-scoring sessions:
Typically saves 90% of cost while catching issues.
Supported Providers¶
OpenAI¶
Pricing (as of Nov 2025): - gpt-4o: $2.50 / 1M input tokens, $10.00 / 1M output - gpt-4o-mini: $0.15 / 1M input tokens, $0.60 / 1M output
Anthropic¶
Pricing: - Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output - Claude 3.5 Haiku: $0.80 / 1M input, $4.00 / 1M output
Judge Output Format¶
Judges return structured JSON:
{
"score": 8,
"reasoning": "The response maintains a professional tone and uses preferred terminology. However, it could be slightly more conversational in casual contexts.",
"strengths": [
"Uses preferred lexicon ('baseline', 'signal')",
"Professional and precise language",
"Clear structure and logic"
],
"weaknesses": [
"Occasionally too formal for casual questions",
"Could use more varied sentence structure"
],
"suggestion": "Consider adjusting formality based on user's tone. Casual questions could receive friendlier responses."
}
This appears in HTML reports and JSON exports.
Calibration Workflow¶
Recommended process for using judges:
1. Initial Run¶
Get baseline metrics without judges:
Review the quantitative scores first.
2. Validate Metrics¶
Check if metrics align with LLM judgment:
Aim for ≥85% agreement rate.
3. Diagnose Issues¶
If agreement is low, find problematic sessions:
Review these manually to understand discrepancies.
4. Adjust Persona¶
Based on judge feedback, refine your persona config:
# Before
voice:
tone: ["professional"]
# After (based on judge feedback)
voice:
tone: ["professional", "approachable"]
5. Re-validate¶
Run validation again to confirm improvement:
Advanced: Custom Prompts¶
You can customize the judge prompt (advanced users):
from alignmenter.judges import AuthenticityJudge
judge = AuthenticityJudge(
persona_path="configs/persona/brand.yaml",
judge_provider=judge_provider,
custom_prompt="Evaluate if this response matches our brand voice..."
)
Cost Optimization¶
Recommended Strategies¶
Development:
Pre-deployment validation:
Production monitoring:
Typical Costs¶
For a 60-session dataset:
| Strategy | Sessions Judged | Cost (gpt-4o-mini) | Cost (gpt-4o) |
|---|---|---|---|
| Full (100%) | 60 | $0.10 | $0.60 |
| Sample 20% | 12 | $0.02 | $0.12 |
| On failure (10% fail) | 6 | $0.01 | $0.06 |
Troubleshooting¶
"Judge API key not found"¶
Set your API key:
Judge calls are slow¶
- Use
gpt-4o-miniinstead ofgpt-4o - Reduce sample size
- Run in parallel (coming soon)
High cost¶
- Use
--judge-budgetto set limits - Try
on_failurestrategy - Use mini/haiku models
Low agreement rate¶
- Review diagnose-errors output
- Check if persona definition is clear
- Ensure reference examples are representative
- Consider adjusting metric weights
Next Steps¶
- Calibration Guide - Full calibration workflow
- Persona Guide - Improve persona definitions
- CLI Reference - Full command options