Built for hiring ML engineers and applied scientists

Stop detecting AI.
Start measuring judgment.

Most interview tools still try to detect or ban AI — a losing arms race. The newer ones lifted the ban but kept the same rubric underneath, so the scores became noise. We don't measure whether they can code. AI can. We measure instinct, rationale, judgment, and the ability to adapt under pressure — calibrated to the seniority you're hiring for, with every score traced to the moments in the session that produced it.

Instinct
Rationale
Judgment
Adaptability

The four things AI can't replicate. The four things we measure.

Banning AI was the wrong answer. Allowing it isn't enough.

The interview rubric was designed for a world where candidates work alone. That world doesn't exist anymore.

The AI ban is a losing arms race.
Lockdown browsers, eye tracking, plagiarism detection — all easily bypassed. Top candidates use AI anyway. You can't gate against the future, and trying signals to candidates that you're behind it.
Allowing AI on an old rubric just creates noise.
Toy problems are solved instantly by AI. Bolting an "AI mode" onto leetcode-style assessments and headline correctness scores tells you less than nothing about the engineer underneath.
The rubric itself needs to change.
Scoping ambiguity, catching AI hallucinations, picking the right metric, defending tradeoffs — these are what distinguish strong hires now. They need calibrated 1–5 anchors per seniority level, with evidence behind every score.
Step 1 — Configure

Paste a JD. Get a tailored interview in seconds.

Paste any job description — Google, Stripe, a Series-B startup, an internal req doc. Caliber 8 reads it, picks the right archetype, sets the seniority bar, selects the attributes to probe, and tailors the problem to the company and the work. You review and ship.

caliber8.app / start
Interviewer setup
Job description
Paste the full JD here. Caliber 8 will infer the role family, level, archetype, and attribute focus.
Drag to step through the setup flowauto-playing · hover to pause
Step 2 — Run the session

Then watch the trace tell the story.

The candidate works through the tailored problem with their AI agent. Mid-session, the interviewer throws curveballs that force specific tradeoff decisions. Every keystroke, prompt, AI response, and decision is captured — then graded against your selected attributes at the seniority bar you set.

caliber8.app / candidate / b6b98d95…
Candidate
Problem · Payments Fraud at PaySwift
You've joined the fraud team at PaySwift, a payments processor handling ~8M transactions per day. The current rule-based system catches ~70% of confirmed fraud but generates a 3% false-positive rate — customer support is buckling and merchants are complaining. In the next 60 minutes: show how you'd build something better, validate it, and what you'd actually ship for v1. You have 6 months of historical transactions, the rule engine's outputs and decisions, labels from the investigation team, and the AI agent of your choice.
Scratchpad
(empty — candidate hasn't started)
Shared sandboxmain.py
shared with interviewer
(empty — both you and the interviewer can edit here)
AI Agent
No prompts yet.
caliber8.app / interviewer / b6b98d95…
Interviewer
Live Trace0 events
Waiting for candidate to start.
Shared sandboxmain.py
live · type here to send edits to candidate
(empty — waiting for candidate or your edits)
Curveball Library
Label provenance brokenfire
Latency drops to 8msfire
Wrong metricfire
Drag the bar to scrub through the sessionauto-playing · hover to pause
Step 3 — What you get back

A scorecard you could defend in a debrief.

Every score is anchored to the seniority bar and cites the specific trace events that produced it. No black-box rating, no vibes-based debrief — evidence in, decision out.

Devesh A. — Staff ML Engineer, YouTube Discovery Efficiency

Ranking & Recommendations archetype · 58 min · 4 curveballs fired

L7Calibrated to Staff bar
Hire

Confidence: High · Avg 4.4 / 5

Trajectory

Opened by mapping ambiguity in the scratchpad rather than touching code — listed metric, latency, label noise, scope, and FP/FN cost as unknowns, then committed to an audit-first plan. Caught the planted investigation_outcome_code leakage before training by prompting the AI for a feature-provenance audit. Re-planned cleanly under the 8ms latency curveball — abandoned the deep model, articulated the tradeoff, switched to a lightgbm GBDT with dollar-weighted sample weights. Held the strategic plan throughout; used the AI as a fast executor on well-scoped subtasks.

Per-attribute scores · calibrated to L7
A1
Scoping under ambiguity
5 / 5

Mapped 5 dimensions of ambiguity in scratchpad before any code. Named explicit assumptions, picked scope with stated rationale, revisited under latency curveball.

ev_4a1c8eev_e9a4b2
B2
Eval design
4 / 5

Designed clean held-out eval on manually-reviewed slice after label-noise wrench. Considered multiple metrics but didn't articulate slice-level eval explicitly — falls just shy of 5 at L7 bar.

ev_895d5781
C2
Approach justification
5 / 5

Under latency wrench, named two alternatives (GBDT vs two-stage), articulated tradeoff axis (recall vs. complexity), picked with stated criterion (shippable v1 with clean extension path).

ev_e9a4b2ev_f1b277
D1
Constraint adaptation
5 / 5

Re-planned deliberately when 8ms budget landed. Identified what survived (feature audit) vs. what didn't (deep model). Rewrote sandbox in 6 minutes.

ev_e9a4b2ev_f1b277
D2
Operational thinking
4 / 5

Mentioned monitoring and retraining cadence when asked about adversarial drift. Could have proposed on-call playbook unprompted to hit 5 at L7.

ev_a23b91
E2
Hallucination catching
5 / 5

Caught planted leakage proactively via own audit prompt before AI surfaced it. Did not need interviewer to flag.

ev_d72091ev_3c8d12
F1
Tradeoff articulation
4 / 5

Articulated tradeoffs on most decisions in scratchpad. Did not explicitly state FN cost when defending dollar-weighted metric.

ev_895d5781
AI-collaboration analysis
12
prompts to AI
2 / 2
hallucinations caught
8
AI suggestions accepted
3
AI suggestions overruled

Direction quality: Candidate held the plan and used the AI as labor on well-scoped subtasks. Pushed back on the AI's suggestion to use investigation_outcome_code after their own audit flagged it as post-hoc. Did not chase the AI on architectural choices.

Graded against the L7 bar · Every score links to trace eventsGenerated in 24.7s
Versus everything else

Other platforms allow AI. None of them rebuilt the rubric.

The defensible difference is what gets measured, calibrated, and cited. Allowing AI is the entry ticket. The rubric is the product.

DimensionTake-homeLive, AI bannedAI-allowed live (newer tools)Caliber 8
What's measuredFinal output. AI use unobservable.Correctness on narrow problems. Memorization.Same correctness frame, with an AI toggle bolted on.14 behavioral attributes — scoping, tradeoffs, eval rigor, AI-collaboration.
Level calibrationNone — same task, every level.Interviewer's gut.Single bar; gut for the rest.1–5 anchors per attribute, per level (L4–L8). A 5 at L4 ≠ a 5 at L7.
Evidence behind scoresHard to challenge after the fact.“Solved or didn't.”Headline composite score.Every score links to specific trace events. No black box.
Per-role tailoringGeneric problem, every candidate.Generic problem, every candidate.Same fixed problem catalog.Seed problem rewritten from your JD. Curveballs probe the attributes that matter.
AI-collaboration measuredNo — you can't observe it.No — AI is banned.Allowed but not scored as a behavior.Yes — prompt deliberation, hallucination catching, AI direction all scored.
Curveballs / live probesNone.Interviewer riffs.Interviewer riffs.Pre-authored library tagged to specific attributes — fire when ready.
Time to decision3–7 days end-to-end.60 min + 30 min debrief.60 min + 30–60 min debrief.60 min session + scorecard in ~25s.

“AI-allowed live” covers the newer wave (Probe, HackerRank's AI mode, Karat AI, etc.). They lifted the AI ban — a necessary step — but kept the old rubric underneath. The hard and defensible work is rebuilding what gets measured.

The rubric

Instinct, rationale, judgment, adaptability — broken down.

Those four words are the why. Below is the how — the behavioral attributes we score, calibrated 1–5 against the seniority bar you're hiring for. Hiring managers select 5–7 per role; the platform probes exactly those.

Problem Framing
A1Scoping under ambiguity
A2Goal interrogation
Data and Evaluation
B1Data skepticism
B2Eval design
B3Metric judgment
Modeling Judgment
C1Baseline taste
C2Approach justification
Production Realism
D1Constraint adaptation
D2Operational thinking
AI-Collaboration
E1Prompt deliberation
E2Hallucination catching
E3AI direction
Communication
F1Tradeoff articulation
F2Stakeholder translation
The novel category

AI-Collaboration measures what only an AI-allowed interview can: does the candidate prompt deliberately? Catch hallucinations? Direct the agent — or chase it? Every other interview tool is blind to these. They are the skills that matter most on the job now.

How it works

Set up in 5 minutes. Real signal in 60.

1
Pick the level + attributes.
Choose the target seniority (L4 Junior through L8 Principal) and which 5–7 of the 14 attributes matter for this role. The platform configures the problem and curveball tree to probe them.
2
Run the live session.
Candidate gets a meaty, underspecified problem and the AI agent of their choice. You watch the trace fill in real time. Throw curveballs when they create the right moment to test a specific attribute.
3
Get an evidence-cited report.
Our evaluation model reads the trace and scores every attribute against the level bar — with citations to specific trace events as evidence. You confirm or override.

Hire the people who can actually
work with AI well.

30-minute walkthrough on a JD you're actually hiring for. See the session, read the scored report, decide if it's worth a design-partner pilot.