Why We Use One AI Provider Per Surface (And Measured the Tradeoff)

Multi-model consensus scoring sounds safer, but we measured it. The accuracy gain was marginal while cost and latency doubled. Here is why one well-prompted model per surface wins.

Roleplays Team

March 18, 2026 5 min read

Why We Use One AI Provider Per Surface (And Measured the Tradeoff)

If your organization uses AI to evaluate employee training, there is a question worth asking: does running more models actually make the score more reliable, or just more expensive?

“Multi-AI consensus” has an intuitive appeal. Route every answer through several models, average the results, and surely the bias washes out. We tested exactly that. The accuracy gain was real but marginal, while cost and latency climbed sharply. So we made a deliberate choice: Roleplays scores each session with a single specialized model per surface, chat and voice, and invests the saved budget in sharper criteria and faster feedback.

Here is the reasoning, with the numbers we actually saw.

0-100

score per criterion, each backed by a quote from the transcript

How Roleplays grades every session

The Appeal of Consensus (And Where It Breaks Down)

The argument for multi-model scoring is straightforward. Every model is a product of its training data and objectives, so every model has blind spots:

Verbosity bias, certain models reward longer answers even when a concise response is objectively better
Formality preference, models trained on academic text may penalize conversational language, even in sales contexts where warmth matters
Cultural framing, a model may default to North American business norms, unfairly penalizing approaches effective in Latin American or Asian markets
Anchoring to phrasing, some models latch onto specific keywords rather than evaluating the substance of a response

If you average several models, the logic goes, these quirks cancel out. The problem is what averaging costs you, and how little it buys.

“Running three models to grade one answer triples your cost and latency. The question is whether the score gets three times better. It does not.”

What We Measured

We ran a controlled comparison: the same scenarios and trainee responses scored by a single well-prompted model versus a consensus of multiple models, both validated against expert human graders.

The accuracy gain was marginal

A single model with a strong, explicit rubric already correlated tightly with human graders. Adding more models nudged that correlation up by a few points at best, and most of the disagreement they “resolved” came from vague criteria, not from model bias. Tighten the rubric and the disagreement largely disappears, with one model.

The cost and latency did not

Every extra evaluator is another full inference pass. Two models roughly double the cost per session. Three roughly triple it, and you wait for the slowest one to finish before you can synthesize a score. For learners practicing in real time, that delay is the difference between feedback that lands and feedback that arrives after they have moved on.

2-3x

the cost and latency of consensus scoring, for a few points of accuracy at best

Source: internal evaluation benchmark, single vs multi-model

Where the Reliability Actually Comes From

Bias does not get fixed by adding more graders. It gets fixed by telling the grader exactly what good looks like. That is where we put the work.

Per-criterion scoring, not a single number

Roleplays does not return one opaque grade. Each session is scored per criterion, from 0 to 100, against a rubric the team defines. “Demonstrates product knowledge” becomes “correctly states the three primary indications and at least one contraindication.” Explicit criteria are what make a single model consistent, and what make a consensus mostly redundant.

Evidence, not vibes

Every criterion score is backed by a direct quote from the transcript. That is what makes a score auditable: a reviewer can see the exact moment that earned or lost the points, instead of trusting an aggregate.

Weighting and critical criteria

Criteria are weighted by what matters for the competency. A criterion can be marked critical, meaning failing it fails the session on its own, regardless of the rest. In compliance contexts, a missed mandatory disclosure should sink the session even if everything else was excellent. Consensus averaging actively works against this, it smooths over the single failure you most need to catch.

Strengths, gaps, and recommendations

The report returns more than a number: strengths, gaps, and concrete recommendations. That is the feedback that changes behavior, and it comes from one model reasoning over the whole transcript, not from reconciling several partial views.

See how per-criterion scoring with transcript evidence works in practice.

Explore the evaluation system →

What About Compliance-Driven Training?

In regulated industries, training evaluation is not just a development tool, it is auditable evidence. When a pharmaceutical company demonstrates GMP compliance to ANVISA, or a financial institution proves suitability training to a regulator, the integrity of the method matters.

Here is the thing: auditability comes from traceability, not from the number of models. A score tied to a specific quote, against a published rubric, with a clear critical-criterion rule, is far more defensible than “three models agreed on a 7.” The first one a regulator can verify line by line. The second one is a black box wearing a committee’s coat.

Requirement	How we meet it
Traceability	Every criterion score links to the exact transcript quote that justifies it
Consistency	A single specialized model per surface applies the same rubric the same way every time
Defensible methodology	Explicit weights and critical criteria, documented and reproducible

One Provider Per Surface

Chat and voice are different problems. A voice transcript carries pacing, interruptions, and filler that text does not. So rather than forcing one general model to do everything, Roleplays uses one specialized model per surface, tuned for that channel. That is specialization, not consensus: each surface gets the best single grader for its medium, scoring against the same kind of explicit, evidence-backed rubric.

Getting Evaluation Right

You do not need a fleet of models to grade training well. You need:

Explicit rubrics, Replace vague criteria with behavioral specifics that a single model, and a human, can apply unambiguously.
Evidence requirements, Every score should point to a quote. If a grade cannot cite the transcript, it should not stand.
Critical-criterion rules, Decide upfront which failures should fail the whole session, and make them non-negotiable.
Validation against humans, Before going live, correlate model scores with expert graders. A well-prompted single model clears this bar.

Conclusion

Consensus scoring sounds safer than it is. We measured the tradeoff and chose the option that is more reliable, cheaper, and faster: a single specialized model per surface, scoring per criterion from 0 to 100, every score backed by transcript evidence, with weighted and critical criteria, returning strengths, gaps, and recommendations.

Roleplays grades every simulated conversation this way, so the score is something you can read, verify, and act on.

Ready to see evaluation you can actually audit, line by line?

Request a demo →

AI evaluation training quality bias reduction compliance

Stay in the loop

Get the latest insights on corporate training delivered to your inbox.