Core Concepts¶
Understanding how Alignmenter evaluates your AI chatbot.
The Three Metrics¶
Alignmenter measures AI behavior across three dimensions:
1. Authenticity (Brand Voice)¶
Question: Does the AI sound like your brand?
How it works: - Compares AI responses to reference examples using semantic embeddings - Checks for personality traits (formal vs casual, technical vs simple, etc.) - Validates lexicon usage (preferred words vs avoided words) - Optional LLM judge provides qualitative analysis
Formula:
Score Range: 0.0 to 1.0 (higher is better)
Example:
# Brand voice wants professional tone
preferred_words: ["baseline", "signal", "analysis"]
avoided_words: ["lol", "hype", "crushing it"]
# AI response: "Our baseline analysis shows strong signal"
# ✓ High authenticity (uses preferred terms, professional tone)
# AI response: "We're totally crushing it with these results lol"
# ✗ Low authenticity (uses avoided slang, informal tone)
2. Safety¶
Question: Does the AI avoid harmful or inappropriate content?
How it works: - Keyword pattern matching for known harmful phrases - Optional LLM judge for nuanced safety evaluation - Optional offline ML classifier (distilled-safety-roberta) - Tracks agreement between different safety checks
Formula:
Score Range: 0.0 to 1.0 (higher is safer)
Categories: - Harmful content (violence, self-harm) - Offensive language (profanity, slurs) - Misinformation (medical, financial) - Policy violations (custom rules)
3. Stability (Consistency)¶
Question: Does the AI behave consistently across sessions?
How it works: - Measures response variance within a single conversation - Compares behavior across different test sessions - Detects semantic drift over time - Useful for regression testing when updating models
Formula:
Score Range: 0.0 to 1.0 (higher is more consistent)
Use cases: - Detect when a model update changes behavior unexpectedly - Ensure consistent responses to similar questions - Validate fine-tuning didn't break existing behavior
Personas¶
A persona defines your brand voice in YAML format:
id: my-brand
name: "My Brand Assistant"
description: "Friendly, helpful, professional"
voice:
tone: ["friendly", "professional", "helpful"]
formality: "business_casual"
lexicon:
preferred:
- "happy to help"
- "let me assist you"
avoided:
- "no problem"
- "sure thing"
examples:
- "I'd be happy to help you with that request."
- "Let me assist you in finding the right solution."
Personas are stored in configs/persona/ and referenced in run configs.
Datasets¶
Test datasets are JSONL files containing conversation turns:
{"session_id": "001", "turn": 1, "user": "Hello!", "assistant": "Hi! How can I help?"}
{"session_id": "001", "turn": 2, "user": "What's the weather?", "assistant": "I can help with that..."}
Key fields:
- session_id - Groups turns into conversations
- turn - Order within session
- user - User message
- assistant - AI response (optional - can be generated)
Datasets support:
- Regeneration: Add --generate-transcripts to call the configured provider and refresh assistant turns
- Caching: By default Alignmenter reuses recorded transcripts for deterministic scoring
- Sanitization: Built-in PII scrubbing
LLM Judges (Optional)¶
LLM judges provide qualitative analysis alongside quantitative metrics:
Benefits: - Human-readable explanations of scores - Catches nuanced brand voice issues - Validates metric accuracy
Cost control: - Sampling strategies (on_failure, random, stratified) - Budget guardrails ($1, $5, $10, etc.) - Cost estimation before running
Example output:
{
"score": 8,
"reasoning": "Tone matches brand guidelines well, but occasionally too formal in casual contexts",
"strengths": ["Professional language", "Clear structure"],
"weaknesses": ["Could be more conversational"],
"suggestion": "Consider softening language in informal scenarios"
}
Reports¶
Every test run generates an interactive HTML report:
Report Sections¶
- Summary Card
- Overall grade (A/B/C/D/F)
- Metric breakdown
-
Model and dataset info
-
Score Distribution
- Interactive charts (Chart.js)
- Session-level breakdowns
-
Trend analysis
-
Session Details
- Turn-by-turn analysis
- Flagged issues
-
Judge feedback (if enabled)
-
Reproducibility
- Python version
- Model details
- Timestamps
-
Config snapshot
-
Exports
- Download as CSV
- Download as JSON
- Share URL (if hosted)
Calibration¶
Calibration validates that your metrics match human judgment:
Validation Workflow¶
# 1. Run base evaluation
alignmenter run --model gpt-4o --config configs/brand.yaml
# 2. Validate with LLM judge
alignmenter calibrate validate --judge gpt-4o --judge-sample 0.2
# 3. Check agreement rate
# Output: ✓ Judge agreement: 87.5%
Calibration Commands¶
calibrate validate- Check judge agreement with metricscalibrate diagnose-errors- Find sessions where judge disagreescalibrate analyze-scenarios- Deep dive into specific test cases
See the Calibration Guide for details.
Next Steps¶
- Quick Start - Run your first test
- Persona Guide - Customize your brand voice
- LLM Judges - Add qualitative analysis
- CLI Reference - Full command docs