Most interview tools still try to detect or ban AI — a losing arms race. The newer ones lifted the ban but kept the same rubric underneath, so the scores became noise. We don't measure whether they can code. AI can. We measure instinct, rationale, judgment, and the ability to adapt under pressure — calibrated to the seniority you're hiring for, with every score traced to the moments in the session that produced it.
The four things AI can't replicate. The four things we measure.
The interview rubric was designed for a world where candidates work alone. That world doesn't exist anymore.
Paste any job description — Google, Stripe, a Series-B startup, an internal req doc. Caliber 8 reads it, picks the right archetype, sets the seniority bar, selects the attributes to probe, and tailors the problem to the company and the work. You review and ship.
Paste the full JD here. Caliber 8 will infer the role family, level, archetype, and attribute focus.The candidate works through the tailored problem with their AI agent. Mid-session, the interviewer throws curveballs that force specific tradeoff decisions. Every keystroke, prompt, AI response, and decision is captured — then graded against your selected attributes at the seniority bar you set.
Every score is anchored to the seniority bar and cites the specific trace events that produced it. No black-box rating, no vibes-based debrief — evidence in, decision out.
Ranking & Recommendations archetype · 58 min · 4 curveballs fired
Confidence: High · Avg 4.4 / 5
Opened by mapping ambiguity in the scratchpad rather than touching code — listed metric, latency, label noise, scope, and FP/FN cost as unknowns, then committed to an audit-first plan. Caught the planted investigation_outcome_code leakage before training by prompting the AI for a feature-provenance audit. Re-planned cleanly under the 8ms latency curveball — abandoned the deep model, articulated the tradeoff, switched to a lightgbm GBDT with dollar-weighted sample weights. Held the strategic plan throughout; used the AI as a fast executor on well-scoped subtasks.
Mapped 5 dimensions of ambiguity in scratchpad before any code. Named explicit assumptions, picked scope with stated rationale, revisited under latency curveball.
Designed clean held-out eval on manually-reviewed slice after label-noise wrench. Considered multiple metrics but didn't articulate slice-level eval explicitly — falls just shy of 5 at L7 bar.
Under latency wrench, named two alternatives (GBDT vs two-stage), articulated tradeoff axis (recall vs. complexity), picked with stated criterion (shippable v1 with clean extension path).
Re-planned deliberately when 8ms budget landed. Identified what survived (feature audit) vs. what didn't (deep model). Rewrote sandbox in 6 minutes.
Mentioned monitoring and retraining cadence when asked about adversarial drift. Could have proposed on-call playbook unprompted to hit 5 at L7.
Caught planted leakage proactively via own audit prompt before AI surfaced it. Did not need interviewer to flag.
Articulated tradeoffs on most decisions in scratchpad. Did not explicitly state FN cost when defending dollar-weighted metric.
Direction quality: Candidate held the plan and used the AI as labor on well-scoped subtasks. Pushed back on the AI's suggestion to use investigation_outcome_code after their own audit flagged it as post-hoc. Did not chase the AI on architectural choices.
The defensible difference is what gets measured, calibrated, and cited. Allowing AI is the entry ticket. The rubric is the product.
| Dimension | Take-home | Live, AI banned | AI-allowed live (newer tools) | Caliber 8 |
|---|---|---|---|---|
| What's measured | Final output. AI use unobservable. | Correctness on narrow problems. Memorization. | Same correctness frame, with an AI toggle bolted on. | 14 behavioral attributes — scoping, tradeoffs, eval rigor, AI-collaboration. |
| Level calibration | None — same task, every level. | Interviewer's gut. | Single bar; gut for the rest. | 1–5 anchors per attribute, per level (L4–L8). A 5 at L4 ≠ a 5 at L7. |
| Evidence behind scores | Hard to challenge after the fact. | “Solved or didn't.” | Headline composite score. | Every score links to specific trace events. No black box. |
| Per-role tailoring | Generic problem, every candidate. | Generic problem, every candidate. | Same fixed problem catalog. | Seed problem rewritten from your JD. Curveballs probe the attributes that matter. |
| AI-collaboration measured | No — you can't observe it. | No — AI is banned. | Allowed but not scored as a behavior. | Yes — prompt deliberation, hallucination catching, AI direction all scored. |
| Curveballs / live probes | None. | Interviewer riffs. | Interviewer riffs. | Pre-authored library tagged to specific attributes — fire when ready. |
| Time to decision | 3–7 days end-to-end. | 60 min + 30 min debrief. | 60 min + 30–60 min debrief. | 60 min session + scorecard in ~25s. |
“AI-allowed live” covers the newer wave (Probe, HackerRank's AI mode, Karat AI, etc.). They lifted the AI ban — a necessary step — but kept the old rubric underneath. The hard and defensible work is rebuilding what gets measured.
Those four words are the why. Below is the how — the behavioral attributes we score, calibrated 1–5 against the seniority bar you're hiring for. Hiring managers select 5–7 per role; the platform probes exactly those.
AI-Collaboration measures what only an AI-allowed interview can: does the candidate prompt deliberately? Catch hallucinations? Direct the agent — or chase it? Every other interview tool is blind to these. They are the skills that matter most on the job now.
30-minute walkthrough on a JD you're actually hiring for. See the session, read the scored report, decide if it's worth a design-partner pilot.