Research

AI detection vs oral defense: a better way to verify student understanding.

Published May 12, 2026 · 6 minute read

When written submissions stopped being a reliable proxy for understanding, schools reached for AI text detectors. The detection workflow is structurally limited: it estimates authorship, not comprehension. This page explains why detection answers the wrong question, what an oral defense actually measures, and how Tolus automates that workflow for classroom use.

The category

What AI detectors actually do.

An AI text detector estimates the probability that a passage was produced by a large language model, based on statistical features of the text such as token-level perplexity and burstiness. These features are correlations, not proofs: human-written text that happens to be uniform in style can score high, and AI-written text that has been mildly edited can score low. The vendors concede the limit. OpenAI released its own AI Text Classifier in early 2023 and then discontinued it the same year, citing its low rate of accuracy. The errors are also not evenly distributed: Liang and colleagues (2023) found that several GPT detectors routinely flagged writing by non-native English speakers as AI-generated.

Sources: OpenAI, “AI Text Classifier” (released and discontinued in 2023 for low accuracy); Liang et al., “GPT detectors are biased against non-native English writers,” Patterns, 2023.

The detector’s output is a probability. The teacher is left to translate that probability into an action: ignore it, investigate informally, or escalate to a discipline process that can take hours and bruises trust if it is wrong.

The category limit matters more than any specific false-positive rate. Even a perfect AI-text classifier would not tell you whether the student understands the argument they handed in. It would only tell you who or what produced the text.

The failure modes

Why detection-only workflows are limited.

Detection-only workflows put the teacher in an awkward position. They have a probability number with no audit trail, no quoted evidence, and no way to give the student a fair chance to respond. In practice, three failure modes recur:

False accusations. The probability is wrong, the investigation goes badly, the school loses parental trust.
Drift to inaction. Teachers stop acting on the probability because it cannot be defended. The signal becomes noise.
Decay over time. Detection accuracy slips as models evolve. The same workflow returns less actionable signal every year.

The deeper issue is that detection answers a question the classroom no longer needs answered. Whether the student used AI to write the first draft is increasingly unanswerable, and also increasingly beside the point. The grade-relevant question is whether the student understands the work.

The alternative

What oral defense measures.

Oral defense is older than software. Doctoral candidates have defended their dissertations for centuries. The reason it works: a question generated from the candidate’s own work, asked under genuine assessment pressure, with the examiner free to follow up where reasoning is shallow. What the candidate says becomes the evidence.

At classroom scale, three properties matter:

Paraphrase as a proxy for understanding. A student who can express a concept without using its label has internalized it; a student who can only repeat the phrase has not. Defense forces paraphrase.
Transfer as a proxy for depth. Asked for an unprompted second example, a student who understands the concept supplies one. A student who has memorized the surface cannot.
Voice as a fairness primitive. Voice removes the easiest cheating shortcut and produces a transcript that is auditable and citable. The student is heard, not inferred about.

These properties are durable. They do not degrade as models improve, because they measure the student, not the writing.

The behavior change

Why oral defense changes how students study.

When students know a short voice exam follows every written submission, the rational study strategy shifts. Optimizing the submission no longer pays; optimizing comprehension does. A student who uses AI to learn the material defends fluently. A student who outsources the writing without engaging cannot.

This is among the most durable findings in assessment research, often called the backwash (or washback) effect: the form of the assessment, far more than any instruction to study deeply, governs how students actually prepare. Make the assessment a defense of one’s own reasoning, and the rational incentive shifts from producing a clean artifact to being able to explain it.

Sources: Biggs and Tang, Teaching for Quality Learning at University (on assessment backwash); Boud and Falchikov, eds., Rethinking Assessment in Higher Education: Learning for the Longer Term (Routledge, 2007).

The fairness rules

How teachers can run oral defense fairly.

Set the rubric in advance. Defense should score the same criteria the writing would have.
Keep it short. Two to four minutes per student is enough to discriminate between rote and real understanding.
Show the transcript. Students should see the conversation their score came from.
Cite every sub-score. The teacher (and the parent, if escalated) should be able to read the moment that produced a low number.
Provide accessible alternatives. Allow text input where a microphone is unavailable, replays of the question, and standard accommodations.

In practice

How Tolus automates the workflow.

Tolus is the operationalization of the above. Each defense:

Reads the student’s submission and generates three to five voice questions specific to the work, weighted by the teacher’s rubric.
Listens to the student’s spoken answer, follows up where reasoning is shallow, and stops at the rubric depth the teacher set.
Scores only what the student explicitly said, with each rubric category cited from a quoted moment in the transcript.
Returns the transcript and the score to the teacher and writes the result back into Google Classroom.

The teacher’s time per submission is the time it takes to read the receipt before posting the score. The student’s time per submission is two to four minutes.

Where to next

Continue reading.

Pilot Tolus on one of your courses.

Free for one course through the term. Run it on the assignments where you most need a defensible verdict.

Join the beta→