Built for hiring ML engineers and applied scientists

Stop detecting AI.
Start measuring judgment.

Most interview tools still try to detect or ban AI — a losing arms race. The newer ones lifted the ban but kept the same rubric underneath, so the scores became noise. We don't measure whether they can code. AI can. We measure instinct, rationale, judgment, and the ability to adapt under pressure — calibrated to the seniority you're hiring for, with every score traced to the moments in the session that produced it.

Request a demo →Watch a session ↓

Instinct

Rationale

Judgment

Adaptability

The four things AI can't replicate. The four things we measure.

Banning AI was the wrong answer. Allowing it isn't enough.

The interview rubric was designed for a world where candidates work alone. That world doesn't exist anymore.

The AI ban is a losing arms race.

Lockdown browsers, eye tracking, plagiarism detection — all easily bypassed. Top candidates use AI anyway. You can't gate against the future, and trying signals to candidates that you're behind it.

Allowing AI on an old rubric just creates noise.

Toy problems are solved instantly by AI. Bolting an "AI mode" onto leetcode-style assessments and headline correctness scores tells you less than nothing about the engineer underneath.

The rubric itself needs to change.

Scoping ambiguity, catching AI hallucinations, picking the right metric, defending tradeoffs — these are what distinguish strong hires now. They need calibrated 1–5 anchors per seniority level, with evidence behind every score.

Step 1 — Configure

Paste a JD. Get a tailored interview in seconds.

Paste any job description — Google, Stripe, a Series-B startup, an internal req doc. Caliber 8 reads it, picks the right archetype, sets the seniority bar, selects the attributes to probe, and tailors the problem to the company and the work. You review and ship.

caliber8.app / start

Interviewer setup

Job description

Paste the full JD here. Caliber 8 will infer the role family, level, archetype, and attribute focus.

Drag to step through the setup flowauto-playing · hover to pause

Step 2 — Run the session

Then watch the trace tell the story.

The candidate works through the tailored problem with their AI agent. Mid-session, the interviewer throws curveballs that force specific tradeoff decisions. Every keystroke, prompt, AI response, and decision is captured — then graded against your selected attributes at the seniority bar you set.

caliber8.app / candidate / b6b98d95…

Candidate

Problem · Payments Fraud at PaySwift

You've joined the fraud team at PaySwift, a payments processor handling ~8M transactions per day. The current rule-based system catches ~70% of confirmed fraud but generates a 3% false-positive rate — customer support is buckling and merchants are complaining. In the next 60 minutes: show how you'd build something better, validate it, and what you'd actually ship for v1. You have 6 months of historical transactions, the rule engine's outputs and decisions, labels from the investigation team, and the AI agent of your choice.

Scratchpad

(empty — candidate hasn't started)

Shared sandboxmain.py

shared with interviewer

(empty — both you and the interviewer can edit here)

AI Agent

No prompts yet.

caliber8.app / interviewer / b6b98d95…

Interviewer

Live Trace0 events

Waiting for candidate to start.

Shared sandboxmain.py

live · type here to send edits to candidate

(empty — waiting for candidate or your edits)

Curveball Library

Label provenance brokenfire

Latency drops to 8msfire

Wrong metricfire

Drag the bar to scrub through the sessionauto-playing · hover to pause

Step 3 — What you get back

A scorecard you could defend in a debrief.

Every score is anchored to the seniority bar and cites the specific trace events that produced it. No black-box rating, no vibes-based debrief — evidence in, decision out.

Devesh A. — Staff ML Engineer, YouTube Discovery Efficiency

Ranking & Recommendations archetype · 58 min · 4 curveballs fired

L7Calibrated to Staff bar

Hire

Confidence: High · Avg 4.4 / 5

Trajectory

Opened by mapping ambiguity in the scratchpad rather than touching code — listed metric, latency, label noise, scope, and FP/FN cost as unknowns, then committed to an audit-first plan. Caught the planted investigation_outcome_code leakage before training by prompting the AI for a feature-provenance audit. Re-planned cleanly under the 8ms latency curveball — abandoned the deep model, articulated the tradeoff, switched to a lightgbm GBDT with dollar-weighted sample weights. Held the strategic plan throughout; used the AI as a fast executor on well-scoped subtasks.

Per-attribute scores · calibrated to L7

Scoping under ambiguity

5 / 5

Mapped 5 dimensions of ambiguity in scratchpad before any code. Named explicit assumptions, picked scope with stated rationale, revisited under latency curveball.

ev_4a1c8eev_e9a4b2

Eval design

4 / 5

Designed clean held-out eval on manually-reviewed slice after label-noise wrench. Considered multiple metrics but didn't articulate slice-level eval explicitly — falls just shy of 5 at L7 bar.

ev_895d5781

Approach justification

5 / 5

Under latency wrench, named two alternatives (GBDT vs two-stage), articulated tradeoff axis (recall vs. complexity), picked with stated criterion (shippable v1 with clean extension path).

ev_e9a4b2ev_f1b277

Constraint adaptation

5 / 5

Re-planned deliberately when 8ms budget landed. Identified what survived (feature audit) vs. what didn't (deep model). Rewrote sandbox in 6 minutes.

ev_e9a4b2ev_f1b277

Operational thinking

4 / 5

Mentioned monitoring and retraining cadence when asked about adversarial drift. Could have proposed on-call playbook unprompted to hit 5 at L7.

ev_a23b91

Hallucination catching

5 / 5

Caught planted leakage proactively via own audit prompt before AI surfaced it. Did not need interviewer to flag.

ev_d72091ev_3c8d12

Tradeoff articulation

4 / 5

Articulated tradeoffs on most decisions in scratchpad. Did not explicitly state FN cost when defending dollar-weighted metric.

ev_895d5781

AI-collaboration analysis

prompts to AI

2 / 2

hallucinations caught

AI suggestions accepted

AI suggestions overruled

Direction quality: Candidate held the plan and used the AI as labor on well-scoped subtasks. Pushed back on the AI's suggestion to use investigation_outcome_code after their own audit flagged it as post-hoc. Did not chase the AI on architectural choices.

Graded against the L7 bar · Every score links to trace eventsGenerated in 24.7s

Versus everything else

Other platforms allow AI. None of them rebuilt the rubric.

The defensible difference is what gets measured, calibrated, and cited. Allowing AI is the entry ticket. The rubric is the product.

Dimension	Take-home	Live, AI banned	AI-allowed live (newer tools)	Caliber 8
What's measured	Final output. AI use unobservable.	Correctness on narrow problems. Memorization.	Same correctness frame, with an AI toggle bolted on.	14 behavioral attributes — scoping, tradeoffs, eval rigor, AI-collaboration.
Level calibration	None — same task, every level.	Interviewer's gut.	Single bar; gut for the rest.	1–5 anchors per attribute, per level (L4–L8). A 5 at L4 ≠ a 5 at L7.
Evidence behind scores	Hard to challenge after the fact.	“Solved or didn't.”	Headline composite score.	Every score links to specific trace events. No black box.
Per-role tailoring	Generic problem, every candidate.	Generic problem, every candidate.	Same fixed problem catalog.	Seed problem rewritten from your JD. Curveballs probe the attributes that matter.
AI-collaboration measured	No — you can't observe it.	No — AI is banned.	Allowed but not scored as a behavior.	Yes — prompt deliberation, hallucination catching, AI direction all scored.
Curveballs / live probes	None.	Interviewer riffs.	Interviewer riffs.	Pre-authored library tagged to specific attributes — fire when ready.
Time to decision	3–7 days end-to-end.	60 min + 30 min debrief.	60 min + 30–60 min debrief.	60 min session + scorecard in ~25s.

“AI-allowed live” covers the newer wave (Probe, HackerRank's AI mode, Karat AI, etc.). They lifted the AI ban — a necessary step — but kept the old rubric underneath. The hard and defensible work is rebuilding what gets measured.

The rubric

Instinct, rationale, judgment, adaptability — broken down.

Those four words are the why. Below is the how — the behavioral attributes we score, calibrated 1–5 against the seniority bar you're hiring for. Hiring managers select 5–7 per role; the platform probes exactly those.

Problem Framing

A1Scoping under ambiguity

A2Goal interrogation

Data and Evaluation

B1Data skepticism

B2Eval design

B3Metric judgment

Modeling Judgment

C1Baseline taste

C2Approach justification

Production Realism

D1Constraint adaptation

D2Operational thinking

AI-Collaboration

E1Prompt deliberation

E2Hallucination catching

E3AI direction

Communication

F1Tradeoff articulation

F2Stakeholder translation

The novel category

AI-Collaboration measures what only an AI-allowed interview can: does the candidate prompt deliberately? Catch hallucinations? Direct the agent — or chase it? Every other interview tool is blind to these. They are the skills that matter most on the job now.

How it works

Set up in 5 minutes. Real signal in 60.

Pick the level + attributes.

Choose the target seniority (L4 Junior through L8 Principal) and which 5–7 of the 14 attributes matter for this role. The platform configures the problem and curveball tree to probe them.

Run the live session.

Candidate gets a meaty, underspecified problem and the AI agent of their choice. You watch the trace fill in real time. Throw curveballs when they create the right moment to test a specific attribute.

Get an evidence-cited report.

Our evaluation model reads the trace and scores every attribute against the level bar — with citations to specific trace events as evidence. You confirm or override.

Hire the people who can actually
work with AI well.

30-minute walkthrough on a JD you're actually hiring for. See the session, read the scored report, decide if it's worth a design-partner pilot.

Request a demo →Become a design partner

Stop detecting AI.Start measuring judgment.

Banning AI was the wrong answer. Allowing it isn't enough.

Paste a JD. Get a tailored interview in seconds.

Then watch the trace tell the story.

A scorecard you could defend in a debrief.

Devesh A. — Staff ML Engineer, YouTube Discovery Efficiency

Other platforms allow AI. None of them rebuilt the rubric.

Instinct, rationale, judgment, adaptability — broken down.

Set up in 5 minutes. Real signal in 60.

Hire the people who can actuallywork with AI well.

Stop detecting AI.
Start measuring judgment.

Hire the people who can actually
work with AI well.