Alignmenter¶

Automated testing for AI chatbots. Measure brand voice, safety, and consistency across model versions.

Overview¶

Alignmenter is a production-ready evaluation toolkit for teams shipping AI copilots and chat experiences. Ensure your AI stays on-brand, safe, and stable across model updates.

Three Core Metrics¶

🎨 Authenticity – Does the AI match your brand voice? Measures semantic similarity, linguistic traits, and lexicon compliance.
🛡️ Safety – Does it avoid harmful outputs? Combines keyword rules, LLM judges, and offline ML classifiers.
⚖️ Stability – Are responses consistent? Detects semantic drift and variance across sessions.

Why Alignmenter?¶

Unlike generic LLM evaluation frameworks, Alignmenter is purpose-built for persona alignment:

Persona packs: Define your brand voice in YAML with examples, lexicon, and traits
Local-first: Works without constant API calls (optional LLM judge for qualitative analysis)
Budget-aware: Built-in cost tracking and guardrails
Reproducible: Deterministic scoring, full audit trails
Privacy-focused: Local models available, sanitize production data before evaluation

Quick Example¶

# Install
pip install alignmenter

# Initialize project
alignmenter init

# Run test (regenerate transcripts)
alignmenter run --config configs/run.yaml --generate-transcripts

# Default run (reuses cached transcripts)
alignmenter run --config configs/run.yaml

# View report
alignmenter report --last

Output:

Loading dataset: 60 turns across 10 sessions
Running model: openai:gpt-4o-mini
✓ Brand voice score: 0.83 (range: 0.79-0.87)
✓ Safety score: 0.95
✓ Consistency score: 0.88
Report written to: reports/2025-11-06_14-32/index.html

Key Features¶

🎯 Persona-First Design¶

Define your brand voice declaratively:

# configs/persona/mybot.yaml
id: mybot
name: "MyBot Assistant"
description: "Professional, evidence-driven, technical"

voice:
  tone: ["professional", "precise", "measured"]
  formality: "business_casual"

  lexicon:
    preferred:
      - "baseline"
      - "signal"
      - "alignment"
    avoided:
      - "lol"
      - "bro"
      - "hype"

examples:
  - "Our baseline analysis indicates a 15% improvement."
  - "The signal-to-noise ratio suggests this approach is viable."

📊 Interactive Reports¶

Report cards with overall grades (A/B/C)
Interactive charts (Chart.js visualizations)
Calibration diagnostics (bootstrap confidence intervals, judge agreement)
Reproducibility section (Python version, model, timestamps)
Export to CSV/JSON for custom analysis

🔧 Production-Ready¶

Multi-provider support: OpenAI, Anthropic, vLLM, Ollama
Budget guardrails: Halt at 90% of judge API budget
Cost projection: Estimate expenses before execution
PII sanitization: Built-in scrubbing with alignmenter dataset sanitize
Offline mode: Works without internet using local models

Use Cases¶

🏢 Enterprise AI Teams¶

Pre-deployment testing: Verify brand voice before shipping
Regression testing: Catch drift when updating models
A/B testing: Compare GPT-4 vs Claude vs fine-tuned models
Compliance audits: Generate safety scorecards for regulators

🚀 Startups Building AI Products¶

Rapid iteration: Test persona changes in CI/CD
Budget constraints: Use offline classifiers to reduce API costs
Multi-tenant: Different personas for different customers
Quality assurance: Automated checks on every release

🎓 Research & Academia¶

Persona fidelity studies: Measure alignment with human raters
Safety benchmarks: Compare classifier performance
Reproducible results: Deterministic scoring with fixed seeds

Getting Started¶

Ready to start testing your AI chatbot? Check out the Installation Guide or jump to the Quick Start.

Community & Support¶

GitHub: justinGrosvenor/alignmenter
Issues: Report bugs and request features
License: Apache 2.0

⭐ Star us on GitHub