Framework evaluation
A score only counts if it's anchored
in the competency framework.
Every roleplay template declares which competencies each scenario tests. The AI scores exactly those criteria, no keyword heuristics, no platform-wide catalogue imposed on you.
Session report
Medical Visit, Sceptical Cardiologist
Trainee: Marcela R. · Channel: Voice · 12 min
87
passed
Competencies locked when the session starts
PROD-001
Product mastery
92
OBJ-003
Objection handling
78
COMP-014
Label compliance
95
Evaluated criteria (rubric)
AI insights · Strengths
Anchored the pitch in the HCP's hypertensive patient profile by 1:15. Cited a phase-3 study when challenged on efficacy.
Areas to improve
At 4:32 the HCP asked about interaction with beta-blockers and the response was vague ("I'll check and get back to you"). Recommendation: targeted training on drug interactions.
Your company's competency framework
Every company has its own catalogue of competencies and criteria. It starts from ready-made industry templates at onboarding, then is fully editable: you add competencies specific to your business that don't exist in any catalogue.
The AI scores. Code decides.
The AI owns scoring. The pass/fail rule is auditable code, including "compliance blockers" that fail the session even with a high score (e.g. violating the label → failed, even with 95 overall).
Frozen for audit
Criteria locked when the session starts. Prompt pinned to a specific version. Transcript + audio + report stored with configurable retention. Audit comes out of the box.
From framework to report.
The entire chain is deterministic and auditable.
01
Framework curation
Tenant admin edits competencies, criteria, and scenario contexts. Add, edit, deactivate, everything is versioned.
02
Template declares
In the wizard, the author picks which competencies each scenario of the template tests. Each criterion's weight is configurable.
03
Roleplay freezes
At dispatch, the criteria are snapshotted on the roleplay. Even if the template is edited later, the session runs against the snapshot.
04
AI scores, code decides
Async job: builds the prompt + transcript, asks the AI for structured JSON, parses it, applies pass/fail rules, persists the full aggregate.
Why not multi-AI consensus
Multiple AIs don't add up, they diverge.
We tried it: run 4 models in parallel and take the average. The problem is that each model has a different systematic bias, and the average dilutes the signal from whichever model got it right.
Instead: one curated model per surface, with a versioned prompt vetted against the rubric. Deterministic, debuggable, comparable across sessions.
framework evaluation
- ✗ 4× the cost without 4× the confidence
- ✗ Dilutes divergent bias
- ✗ Hard to debug a single score
- ✗ Inconsistent diff across sessions
Single-provider per surface
- ✓ Cost controlled per call
- ✓ Versioned and auditable prompt
- ✓ Reproducible result
- ✓ Consistent comparison across sessions
Pairs perfectly with
Adaptive Track
Framework gap → automatic roleplay
The framework on this page is the input the Adaptive Track uses to map competency gaps.
Learn more →Dashboards
Progress per competency
Watch every team member rise (or drop) in each framework criterion over time.
Learn more →Compliance
Audit log for every call
Prompt, model, tokens, cost, latency, all logged for regulatory audit.
Learn more →Ready to transform how your team trains?
For organisations with 50+ employees. Book 45 minutes and we'll think the setup through with you.