Skip to content

Core Concepts

Understanding how Alignmenter evaluates your AI chatbot.

The Three Metrics

Alignmenter measures AI behavior across three dimensions:

1. Authenticity (Brand Voice)

Question: Does the AI sound like your brand?

How it works: - Compares AI responses to reference examples using semantic embeddings - Checks for personality traits (formal vs casual, technical vs simple, etc.) - Validates lexicon usage (preferred words vs avoided words) - Optional LLM judge provides qualitative analysis

Formula:

Authenticity = 0.6 × style_similarity + 0.25 × trait_match + 0.15 × lexicon_compliance

Score Range: 0.0 to 1.0 (higher is better)

Example:

# Brand voice wants professional tone
preferred_words: ["baseline", "signal", "analysis"]
avoided_words: ["lol", "hype", "crushing it"]

# AI response: "Our baseline analysis shows strong signal"
# ✓ High authenticity (uses preferred terms, professional tone)

# AI response: "We're totally crushing it with these results lol"
# ✗ Low authenticity (uses avoided slang, informal tone)

2. Safety

Question: Does the AI avoid harmful or inappropriate content?

How it works: - Keyword pattern matching for known harmful phrases - Optional LLM judge for nuanced safety evaluation - Optional offline ML classifier (distilled-safety-roberta) - Tracks agreement between different safety checks

Formula:

Safety = min(1 - violation_rate, judge_score)

Score Range: 0.0 to 1.0 (higher is safer)

Categories: - Harmful content (violence, self-harm) - Offensive language (profanity, slurs) - Misinformation (medical, financial) - Policy violations (custom rules)

3. Stability (Consistency)

Question: Does the AI behave consistently across sessions?

How it works: - Measures response variance within a single conversation - Compares behavior across different test sessions - Detects semantic drift over time - Useful for regression testing when updating models

Formula:

Stability = 1 - normalized_variance(embeddings)

Score Range: 0.0 to 1.0 (higher is more consistent)

Use cases: - Detect when a model update changes behavior unexpectedly - Ensure consistent responses to similar questions - Validate fine-tuning didn't break existing behavior

Personas

A persona defines your brand voice in YAML format:

id: my-brand
name: "My Brand Assistant"
description: "Friendly, helpful, professional"

voice:
  tone: ["friendly", "professional", "helpful"]
  formality: "business_casual"

  lexicon:
    preferred:
      - "happy to help"
      - "let me assist you"
    avoided:
      - "no problem"
      - "sure thing"

examples:
  - "I'd be happy to help you with that request."
  - "Let me assist you in finding the right solution."

Personas are stored in configs/persona/ and referenced in run configs.

Datasets

Test datasets are JSONL files containing conversation turns:

{"session_id": "001", "turn": 1, "user": "Hello!", "assistant": "Hi! How can I help?"}
{"session_id": "001", "turn": 2, "user": "What's the weather?", "assistant": "I can help with that..."}

Key fields: - session_id - Groups turns into conversations - turn - Order within session - user - User message - assistant - AI response (optional - can be generated)

Datasets support: - Regeneration: Add --generate-transcripts to call the configured provider and refresh assistant turns - Caching: By default Alignmenter reuses recorded transcripts for deterministic scoring - Sanitization: Built-in PII scrubbing

LLM Judges (Optional)

LLM judges provide qualitative analysis alongside quantitative metrics:

Benefits: - Human-readable explanations of scores - Catches nuanced brand voice issues - Validates metric accuracy

Cost control: - Sampling strategies (on_failure, random, stratified) - Budget guardrails ($1, $5, $10, etc.) - Cost estimation before running

Example output:

{
  "score": 8,
  "reasoning": "Tone matches brand guidelines well, but occasionally too formal in casual contexts",
  "strengths": ["Professional language", "Clear structure"],
  "weaknesses": ["Could be more conversational"],
  "suggestion": "Consider softening language in informal scenarios"
}

Reports

Every test run generates an interactive HTML report:

Report Sections

  1. Summary Card
  2. Overall grade (A/B/C/D/F)
  3. Metric breakdown
  4. Model and dataset info

  5. Score Distribution

  6. Interactive charts (Chart.js)
  7. Session-level breakdowns
  8. Trend analysis

  9. Session Details

  10. Turn-by-turn analysis
  11. Flagged issues
  12. Judge feedback (if enabled)

  13. Reproducibility

  14. Python version
  15. Model details
  16. Timestamps
  17. Config snapshot

  18. Exports

  19. Download as CSV
  20. Download as JSON
  21. Share URL (if hosted)

Calibration

Calibration validates that your metrics match human judgment:

Validation Workflow

# 1. Run base evaluation
alignmenter run --model gpt-4o --config configs/brand.yaml

# 2. Validate with LLM judge
alignmenter calibrate validate --judge gpt-4o --judge-sample 0.2

# 3. Check agreement rate
# Output: ✓ Judge agreement: 87.5%

Calibration Commands

  • calibrate validate - Check judge agreement with metrics
  • calibrate diagnose-errors - Find sessions where judge disagrees
  • calibrate analyze-scenarios - Deep dive into specific test cases

See the Calibration Guide for details.

Next Steps