Calibration Toolkit¶

The calibration toolkit optimizes persona scoring parameters (component weights, normalization bounds, trait models) using labeled data to improve scoring accuracy.

Why Calibrate?¶

Without calibration, authenticity scores are compressed in the 0.5-0.7 range because: - Traits default to 0.5 (neutral) without training data - Normalization bounds are guesses (may not match your embedding model) - Component weights are hardcoded (may not be optimal for your persona)

With proper calibration, you get: - Better score separation: on-brand > 0.75, off-brand < 0.40 - Higher accuracy: ROC-AUC > 0.75 (ideally > 0.85) - Persona-specific tuning: weights tailored to your brand voice

Quick Start¶

1. Generate Candidates for Labeling¶

Bootstrap unlabeled candidates from your existing dataset:

alignmenter calibrate generate \
  --dataset alignmenter/datasets/demo_conversations.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --output calibration_data/unlabeled/candidates.jsonl \
  --num-samples 50 \
  --strategy diverse

Strategies: - diverse: Sample across all scenario tags (recommended) - edge_cases: Prioritize brand_trap, safety_trap - random: Random sampling

2. Label the Data¶

Interactively label responses as on-brand (1) or off-brand (0):

alignmenter calibrate label \
  --input calibration_data/unlabeled/candidates.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --output calibration_data/labeled/default_v1_labeled.jsonl \
  --labeler your_name

Interactive Prompts: - Shows persona exemplars and lexicon for context - Asks: "1 = On-brand, 0 = Off-brand, s = Skip, q = Quit" - Optionally records confidence (high/medium/low) and notes - Saves progress after each label (safe to interrupt)

Labeling Guidelines: - On-brand (1): Uses preferred vocabulary, matches exemplar style/tone - Off-brand (0): Uses avoided words, wrong tone, generic/bland - Borderline: Mark confidence="low" and add notes

Minimum: 50 examples (25 on-brand, 25 off-brand) Recommended: 100-200 examples for robust calibration

3. Estimate Normalization Bounds¶

Compute empirical min/max for style similarity:

alignmenter calibrate bounds \
  --labeled calibration_data/labeled/default_v1_labeled.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --output calibration_data/reports/bounds_report.json

Output:

{
  "style_sim_min": 0.08,
  "style_sim_max": 0.28,
  "style_sim_mean": 0.18,
  "on_brand_style": {"mean": 0.22, ...},
  "off_brand_style": {"mean": 0.14, ...}
}

4. Optimize Component Weights¶

Grid search to find best weights (style, traits, lexicon):

alignmenter calibrate optimize \
  --labeled calibration_data/labeled/default_v1_labeled.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --bounds calibration_data/reports/bounds_report.json \
  --output calibration_data/reports/weights_report.json \
  --grid-step 0.1

Output:

{
  "best_weights": {
    "style": 0.5,
    "traits": 0.3,
    "lexicon": 0.2
  },
  "metrics": {
    "roc_auc": 0.87,
    "f1": 0.82,
    "correlation": 0.78
  },
  "confusion_matrix": {...}
}

Grid step: 0.1 evaluates ~66 combinations (faster), 0.05 evaluates ~231 (more thorough)

5. Train Trait Model¶

Use existing script to train logistic regression on token features:

alignmenter calibrate-persona \
  --persona-path alignmenter/configs/persona/default.yaml \
  --dataset calibration_data/labeled/default_v1_labeled.jsonl \
  --out alignmenter/configs/persona/default.traits.json

6. Merge Calibration Results¶

Manually merge bounds + weights into the trait model file:

# Edit alignmenter/configs/persona/default.traits.json
{
  "weights": {
    "style": 0.5,
    "traits": 0.3,
    "lexicon": 0.2
  },
  "trait_model": {
    "bias": -0.123,
    "token_weights": {...},
    "phrase_weights": {}
  },
  "style_sim_min": 0.08,
  "style_sim_max": 0.28
}

TODO: Automate this merge step

7. Validate Calibration¶

Test calibration quality on held-out validation set:

alignmenter calibrate validate \
  --labeled calibration_data/labeled/default_v1_labeled.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --output calibration_data/reports/diagnostics.json \
  --train-split 0.8

Output:

{
  "validation_metrics": {
    "roc_auc": 0.85,
    "f1": 0.80,
    "optimal_threshold": 0.52
  },
  "score_distributions": {
    "on_brand": {"mean": 0.78, "std": 0.12},
    "off_brand": {"mean": 0.32, "std": 0.15}
  },
  "error_analysis": {
    "false_positives": [...],
    "false_negatives": [...]
  }
}

8. Re-run Evaluation¶

Use the calibrated persona:

alignmenter run --config alignmenter/configs/run.yaml

The scorer will automatically load default.traits.json if it exists.

Directory Structure¶

calibration_data/
├── unlabeled/
│   └── candidates.jsonl          # Generated candidates for labeling
├── labeled/
│   └── default_v1_labeled.jsonl  # Your labeled data
└── reports/
    ├── bounds_report.json        # Normalization bounds
    ├── weights_report.json       # Optimized component weights
    └── diagnostics.json          # Validation metrics

Note: calibration_data/ is gitignored to protect proprietary labeled data

CLI Reference¶

`alignmenter calibrate generate`¶

Generate candidate responses for labeling.

Options: - --dataset PATH: Input JSONL dataset (required) - --persona PATH: Persona YAML (required) - --output PATH: Output unlabeled candidates (required) - --num-samples INT: Number of candidates (default: 50) - --strategy STR: diverse | random | edge_cases (default: diverse) - --seed INT: Random seed (default: 42)

`alignmenter calibrate label`¶

Interactively label responses.

Options: - --input PATH: Unlabeled candidates JSONL (required) - --persona PATH: Persona YAML (required) - --output PATH: Output labeled JSONL (required) - --append: Append to existing labeled data - --labeler STR: Name of person labeling

`alignmenter calibrate bounds`¶

Estimate normalization bounds from labeled data.

Options: - --labeled PATH: Labeled JSONL (required) - --persona PATH: Persona YAML (required) - --output PATH: Output bounds report JSON (required) - --embedding STR: Embedding provider - --percentile-low FLOAT: Lower percentile (default: 5.0) - --percentile-high FLOAT: Upper percentile (default: 95.0)

`alignmenter calibrate optimize`¶

Optimize component weights via grid search.

Options: - --labeled PATH: Labeled JSONL (required) - --persona PATH: Persona YAML (required) - --output PATH: Output weights report JSON (required) - --bounds PATH: Bounds report JSON (optional but recommended) - --embedding STR: Embedding provider - --grid-step FLOAT: Grid step size (default: 0.1)

`alignmenter calibrate validate`¶

Validate calibration quality.

Options: - --labeled PATH: Labeled JSONL (required) - --persona PATH: Persona YAML with .traits.json (required) - --output PATH: Output diagnostics JSON (required) - --embedding STR: Embedding provider - --train-split FLOAT: Train fraction (default: 0.8) - --seed INT: Random seed (default: 42)

Best Practices¶

Data Collection¶

Stratify by scenario: Ensure all scenario tags are represented
Include edge cases: brand_trap and safety_trap are valuable
Balance classes: Aim for 50/50 on-brand vs off-brand
Quality over quantity: 50 high-quality labels > 200 rushed ones

Labeling¶

Be consistent: Use exemplars as your north star
Trust your judgment: If it feels off-brand, it probably is
Mark uncertainty: Use confidence="low" for borderline cases
Take breaks: Labeling fatigue leads to inconsistency

Calibration¶

Start simple: Get 50 labels, calibrate, evaluate
Iterate: Add more labels in areas of confusion
Validate regularly: Use held-out validation set
Version control: Keep dated snapshots of calibration files

Deployment¶

Test before production: Run on full dataset, inspect score distributions
Monitor drift: Re-calibrate if persona evolves
Document changes: Note what changed between calibration versions

Troubleshooting¶

"ROC-AUC is low (< 0.7)"¶

Causes: - Too few labeled examples - Inconsistent labeling - Persona lexicon doesn't match actual brand voice

Solutions: - Add more labeled data (aim for 100+) - Re-label with stricter guidelines - Update persona exemplars and lexicon

"Scores still compressed after calibration"¶

Causes: - Trait model not trained (still using heuristic) - Bounds are wrong (check bounds_report.json)

Solutions: - Run calibrate_persona.py to train trait model - Verify bounds match your embedding model's actual range

"False positives (off-brand scored high)"¶

Causes: - Lexicon weight too high - Missing avoided words in persona

Solutions: - Reduce lexicon weight, increase style/traits - Add more avoided words to persona YAML

"False negatives (on-brand scored low)"¶

Causes: - Style weight too high, bounds too narrow - Not enough exemplars

Solutions: - Widen normalization bounds (lower percentile_low, raise percentile_high) - Add more diverse exemplars to persona

Advanced Topics¶

Scenario-Specific Calibration¶

To calibrate per-scenario (future):

# Filter labeled data by scenario tag
jq -c 'select(.scenario_tags | contains(["scenario:support"]))' \
  calibration_data/labeled/default_v1_labeled.jsonl \
  > calibration_data/labeled/support_only.jsonl

# Calibrate for support scenario
alignmenter calibrate optimize \
  --labeled calibration_data/labeled/support_only.jsonl \
  --persona alignmenter/configs/persona/default.yaml \
  --output calibration_data/reports/weights_support.json

Active Learning¶

Identify uncertain examples for labeling (future):

# Score unlabeled pool with current calibration
# Select examples where score is close to threshold (e.g., 0.45-0.55)
# Prioritize these for manual labeling

Bayesian Optimization¶

For finer-grained weight search (future):

import optuna

def objective(trial):
    style_w = trial.suggest_float("style", 0.0, 1.0)
    traits_w = trial.suggest_float("traits", 0.0, 1.0 - style_w)
    lexicon_w = 1.0 - style_w - traits_w
    # ... evaluate ROC-AUC with these weights
    return auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

Calibration File Format¶

{persona}.traits.json

{
  "weights": {
    "style": 0.5,
    "traits": 0.3,
    "lexicon": 0.2
  },
  "trait_model": {
    "bias": -0.123,
    "token_weights": {
      "absolutely": 1.2,
      "configure": 0.8,
      "lol": -2.5,
      "bro": -2.0
    },
    "phrase_weights": {
      "let me know": 0.5
    }
  },
  "style_sim_min": 0.08,
  "style_sim_max": 0.28
}

Fields: - weights: Component weights (must sum to 1.0) - trait_model.bias: Logistic regression bias term - trait_model.token_weights: Per-token coefficients - trait_model.phrase_weights: Per-phrase coefficients (optional) - style_sim_min/max: Normalization bounds for style similarity

Contributing¶

Ideas for improving the calibration toolkit:

[ ] Automate merge of bounds + weights into .traits.json
[ ] Add alignmenter calibrate all to run full pipeline
[ ] Generate HTML diagnostics report with charts
[ ] Support cross-validation (k-fold)
[ ] Add active learning recommendations
[ ] Bayesian optimization integration
[ ] Scenario-specific calibration helpers

See docs/calibration_requirements.md for full design docs.

Calibration Toolkit¶

Why Calibrate?¶

Quick Start¶

1. Generate Candidates for Labeling¶

2. Label the Data¶

3. Estimate Normalization Bounds¶

4. Optimize Component Weights¶

5. Train Trait Model¶

6. Merge Calibration Results¶

7. Validate Calibration¶

8. Re-run Evaluation¶

Directory Structure¶

CLI Reference¶

alignmenter calibrate generate¶

alignmenter calibrate label¶

alignmenter calibrate bounds¶

alignmenter calibrate optimize¶

alignmenter calibrate validate¶

Best Practices¶

Data Collection¶

Labeling¶

Calibration¶

Deployment¶

Troubleshooting¶

"ROC-AUC is low (< 0.7)"¶

"Scores still compressed after calibration"¶

"False positives (off-brand scored high)"¶

"False negatives (on-brand scored low)"¶

Advanced Topics¶

Scenario-Specific Calibration¶

Active Learning¶

Bayesian Optimization¶

Calibration File Format¶

Contributing¶

`alignmenter calibrate generate`¶

`alignmenter calibrate label`¶

`alignmenter calibrate bounds`¶

`alignmenter calibrate optimize`¶

`alignmenter calibrate validate`¶