Framework evaluation

A score only counts if it's anchored
in the competency framework.

Every roleplay template declares which competencies each scenario tests. The AI scores exactly those criteria, no keyword heuristics, no platform-wide catalogue imposed on you.

Session report

Medical Visit, Sceptical Cardiologist

Trainee: Marcela R. · Channel: Voice · 12 min

87

passed

Competencies locked when the session starts

PROD-001

Product mastery

92

OBJ-003

Objection handling

78

COMP-014

Label compliance

95

Evaluated criteria (rubric)

Arguments grounded in clinical evidence 92
Understanding of the HCP routine 85
Recovery after a strong objection 78
Closing with a clear next step 88
Label compliance RDC 658 (compliance blocker) 95

AI insights · Strengths

Anchored the pitch in the HCP's hypertensive patient profile by 1:15. Cited a phase-3 study when challenged on efficacy.

Areas to improve

At 4:32 the HCP asked about interaction with beta-blockers and the response was vague ("I'll check and get back to you"). Recommendation: targeted training on drug interactions.

Your company's competency framework

Every company has its own catalogue of competencies and criteria. It starts from ready-made industry templates at onboarding, then is fully editable: you add competencies specific to your business that don't exist in any catalogue.

The AI scores. Code decides.

The AI owns scoring. The pass/fail rule is auditable code, including "compliance blockers" that fail the session even with a high score (e.g. violating the label → failed, even with 95 overall).

Frozen for audit

Criteria locked when the session starts. Prompt pinned to a specific version. Transcript + audio + report stored with configurable retention. Audit comes out of the box.

From framework to report.

The entire chain is deterministic and auditable.

01

Framework curation

Tenant admin edits competencies, criteria, and scenario contexts. Add, edit, deactivate, everything is versioned.

02

Template declares

In the wizard, the author picks which competencies each scenario of the template tests. Each criterion's weight is configurable.

03

Roleplay freezes

At dispatch, the criteria are snapshotted on the roleplay. Even if the template is edited later, the session runs against the snapshot.

04

AI scores, code decides

Async job: builds the prompt + transcript, asks the AI for structured JSON, parses it, applies pass/fail rules, persists the full aggregate.

Why not multi-AI consensus

Multiple AIs don't add up, they diverge.

We tried it: run 4 models in parallel and take the average. The problem is that each model has a different systematic bias, and the average dilutes the signal from whichever model got it right.

Instead: one curated model per surface, with a versioned prompt vetted against the rubric. Deterministic, debuggable, comparable across sessions.

framework evaluation

  • ✗ 4× the cost without 4× the confidence
  • ✗ Dilutes divergent bias
  • ✗ Hard to debug a single score
  • ✗ Inconsistent diff across sessions

Single-provider per surface

  • ✓ Cost controlled per call
  • ✓ Versioned and auditable prompt
  • ✓ Reproducible result
  • ✓ Consistent comparison across sessions

Ready to transform how your team trains?

For organisations with 50+ employees. Book 45 minutes and we'll think the setup through with you.