AI Cheating Doubled on Coding Tests. Here's Why System Design Assessments Are Structurally Cheat-Proof.

In 2025, CodeSignal reported that suspected cheating on their coding assessments rose from 16% to 35% in a single year. That is not a gradual uptick. That is a doubling. And every assessment platform is seeing the same trend line.

Anthropic, the company behind Claude, had to rewrite their own engineering interview questions because candidates were using Claude to pass them. Think about that for a second. The company that builds the AI had to redesign its hiring process because its own product made the existing process unreliable.

Fifty-nine percent of hiring managers now say they suspect AI misrepresentation during technical screens. And Integrity Advocate reports that 73% of higher education students use AI during coursework — with institutions setting a 40% threshold before flagging. The normalization of AI assistance in any assessment context is not a prediction. It is already the baseline.

This article is not about whether candidates should or should not use AI. That debate is settled. AI is everywhere, candidates are using it, and pretending otherwise is a waste of organizational energy. The real question is: what types of assessments still produce a reliable signal when candidates have access to the same AI tools they will use on the job every day?

The answer is structural. Some assessment formats are fundamentally easy to cheat. Others are fundamentally hard. And the difference has nothing to do with proctoring software.

Why Coding Tests Are Easy to Cheat

Coding assessments have a design flaw that no amount of proctoring can fix: they have deterministic outputs.

When you ask a candidate to "implement a function that returns the longest palindromic substring," there is a finite set of correct answers. The algorithm is known. The implementation patterns are documented on thousands of websites. And an AI can generate syntactically correct, test-passing code for this problem in under three seconds.

This is not hypothetical. It is happening at scale right now.

The Structural Problem

Coding tests are cheat-friendly for three specific reasons:

Single correct answer. Most coding challenges converge on one or two optimal solutions. An AI does not need to understand the problem deeply — it needs to recognize the pattern and output the known solution. Pattern matching is what large language models do better than anything else.

Copy-paste output. The deliverable in a coding test is text — code that can be copied directly from an AI output into a code editor. There is no transformation step. There is no interpretation. The AI's output is the final answer.

Binary evaluation. Code either passes the test cases or it does not. There is no rubric that evaluates how the candidate arrived at the solution. The reasoning process is invisible. Only the artifact matters.

These three properties mean that AI assistance on coding tests is not just possible — it is trivial. A candidate with a second browser tab open to ChatGPT, Claude, or Copilot can pass most standard coding assessments without understanding the underlying algorithms at all.

Browser Lockdowns Are Security Theater

The standard response from assessment platforms is to lock down the testing environment. Disable copy-paste. Monitor browser tabs. Flag alt-tabs.

This does not work. Candidates use a second device. They use a phone. They describe the problem verbally to an AI assistant. The constraint surface is too large to patrol.

More fundamentally, browser lockdowns address the wrong problem. The vulnerability is not in the delivery mechanism. It is in the question format. You cannot lock down your way out of a structural weakness.

Why System Design Assessments Are Structurally Cheat-Proof

System design assessments test something that AI cannot fake: the reasoning process itself.

When you ask a candidate to "Design a real-time collaboration tool like Google Docs," there is no single correct answer. There is no algorithm to look up. The quality of the response depends entirely on how the candidate thinks through a series of interconnected decisions — and every decision creates a different design space.

Here is why this format is structurally resistant to AI cheating:

No Single Right Answer

A system design problem has hundreds of valid approaches. Do you use CRDTs or operational transforms for conflict resolution? WebSockets or server-sent events for real-time sync? PostgreSQL or DynamoDB for the persistence layer? Each choice is defensible depending on the constraints the candidate chooses to prioritize.

An AI can generate a plausible-sounding answer. But "plausible-sounding" is not the same as "demonstrates understanding." The evaluation is not about what the candidate chose — it is about why they chose it, what they considered and rejected, and how they would adapt if the constraints changed.

Reasoning IS the Output

In a coding test, the deliverable is code. The reasoning is optional.

In a system design assessment, the reasoning is the deliverable. A candidate who writes "Use Kafka for the message queue" without explaining the tradeoff against RabbitMQ or a simple Redis pub/sub has not produced a useful answer. The answer is the analysis, not the technology name.

This inverts the cheat-ability equation. An AI can produce the names of technologies and draw boxes on a diagram. It cannot produce the nuanced, context-specific reasoning about why this architecture serves this particular set of requirements better than the alternatives.

Follow-Up Probes Expose Shallow Understanding

The most powerful anti-cheating mechanism in a system design assessment is the follow-up question.

"You chose eventual consistency for the collaboration layer. What happens when two users edit the same paragraph simultaneously? Walk me through the conflict resolution path."

"You mentioned caching the document tree in Redis. What is your invalidation strategy? What happens during a cache miss when the document has 500 active collaborators?"

A candidate who copied an architecture from an AI will collapse under probing. They do not know why the choices were made. They cannot trace the implications of a design decision through the system. They cannot adapt when a new constraint is introduced.

This is the structural advantage. Follow-up questions are cheap to generate and devastating to unprepared candidates. They turn a static assessment into a dynamic conversation where depth of understanding is continuously tested.

Open-Ended Means Cheat-Resistant

The broader principle here is simple: the more open-ended a question is, the harder it is to cheat on. Closed-ended questions (what is the time complexity of binary search?) can be answered by looking up the answer. Open-ended questions (how would you design the notification system for a ride-sharing app at 10M daily active users?) require the candidate to make a series of judgment calls that reveal how they think.

AI can generate plausible responses to open-ended questions. But "plausible" is not the same as "specific to this candidate's experience and reasoning patterns." And when you add structured evaluation dimensions — problem framing, system decomposition, tradeoff analysis, scalability, user-centric design — the gap between an AI-generated answer and a genuine expert answer becomes measurable.

The Integrity Layer: Explanation Over Surveillance

Humanly, a conversational AI hiring platform, introduced a concept they call "integrity layers." The idea: instead of surveilling candidates with webcams and keystroke tracking, design the assessment itself to surface honesty through explanation.

Ask WHY, not just WHAT.

A proctoring camera captures whether a candidate looked away from the screen. It does not capture whether they understood their own answer. A follow-up question that asks "Walk me through how you decided on this database schema" captures both — and it does it without the adversarial dynamic of surveillance.

This reframes the problem entirely. Instead of investing in better monitoring technology to catch cheaters, invest in better assessment design that makes cheating structurally useless. The integrity comes from the format, not the proctor.

This is consistent with what the research shows. Candidates cheat more when they feel the assessment is arbitrary or disconnected from real work. They cheat less when the assessment feels like a genuine test of their abilities. System design assessments — because they mirror the actual work of senior engineering — tend to engage candidates rather than alienate them.

There is a practical implication here for hiring teams: every dollar spent on proctoring technology would produce a better return if spent on assessment design instead. A webcam feed that catches a candidate glancing at a second monitor tells you nothing about their competence. A well-designed follow-up question tells you everything.

The approach extends to AI-as-judge evaluation as well. When the evaluator is analyzing reasoning quality rather than checking code against test cases, the evaluation itself becomes an integrity layer. It is very difficult to game a rubric that scores the quality of your tradeoff analysis when the rubric is analyzing the substance of your argument, not just whether you mentioned the right buzzwords.

What the Platforms Are Doing (And Why It Is Not Enough)

The major assessment platforms are aware of the AI cheating problem. Their responses fall into a pattern.

HackerRank: harder problems with AI available. Their AI-Assisted IDE Assessments give candidates access to AI coding tools but increase problem difficulty. The theory: if the AI is a known variable, test how well candidates direct it. The problem: this still tests algorithmic problem-solving. A candidate who prompts Claude to solve a hard graph traversal problem is still demonstrating prompt engineering, not architectural reasoning.

CodeSignal: cheating detection telemetry. CodeSignal has built detection systems that flag suspicious behavior — unusual typing patterns, copy-paste events, answer speed anomalies. This is sophisticated engineering, and it catches some cheaters. But it creates an arms race. As detection improves, cheating methods improve. The fundamental incentive structure is not changed.

Codility: behavioral analysis and keystroke patterns. Codility's approach to detecting AI cheating involves analyzing how candidates write code — typing cadence, pause patterns, editing behavior. If a candidate pastes a fully-formed solution after 30 seconds of inactivity, that is suspicious. The limitation: this catches unsophisticated cheaters. A candidate who types out AI-generated code manually, or who uses an AI to guide their thinking and then implements it themselves, passes the behavioral screen.

These are not bad approaches. They are rational responses to a real problem. But they are all band-aids on a structural wound. The structural problem is that coding tests have deterministic outputs that AI can produce. No amount of behavioral detection changes that.

It is like putting better locks on a house with glass walls. The locks are not the vulnerability.

The common thread across all three platforms: they are optimizing within the constraints of a format that is fundamentally vulnerable. Harder coding problems are still coding problems. Better detection is still detection of a symptom. Behavioral analysis still relies on the assumption that cheaters behave differently than honest candidates — an assumption that erodes as cheating tools become more sophisticated.

The alternative is not better monitoring. It is a different assessment format entirely.

How to Design Cheat-Proof Assessments

If you are responsible for technical hiring and you want assessments that produce a reliable signal regardless of AI access, here are five principles that work:

1. Open-Ended Questions with No Single Correct Answer

Replace "implement this algorithm" with "design this system." The wider the solution space, the harder it is for AI to produce an answer that passes structured evaluation. A good system design question has hundreds of valid approaches — and the evaluation criteria are about reasoning quality, not answer correctness.

2. Structured Answer Sections

Break the response into sections: Requirements, High-Level Design, Low-Level Design, Tradeoffs, Scalability. This forces candidates to show their work at each stage. An AI can generate a plausible HLD. It is much harder to generate a coherent response that flows logically from requirements through tradeoffs through scalability — because that coherence requires understanding that AI fakes poorly.

3. AI Follow-Up Questions

Generate probing follow-up questions based on the candidate's specific responses. "You chose Redis for session storage — what is your eviction strategy under memory pressure?" These questions are specific to what the candidate wrote, which means pre-generated AI answers are useless. The candidate has to understand their own design.

4. Time Allocation Tracking

How a candidate allocates their time across sections reveals their priorities. A senior engineer will spend significant time on requirements and tradeoffs. A junior engineer — or someone copying from an AI — will rush through requirements and spend all their time on implementation details. Time allocation is a behavioral signal that is very difficult to fake.

5. Answer Version History

Auto-snapshot the candidate's work every few minutes. Real thinking is iterative. A candidate who genuinely designs a system will revise their HLD after thinking through the LLD. They will add edge cases to their scalability section after reconsidering a tradeoff. A candidate who pastes in AI-generated content will have a version history that shows sudden, fully-formed sections appearing without iteration.

These five features, working together, create an assessment format where AI assistance does not help in a meaningful way. The candidate still has to think. The thinking is still visible. And the evaluation still measures reasoning quality.

This is where the future of AI collaboration assessments points: not away from AI, but toward formats where AI cannot substitute for genuine expertise.

A Checklist for Hiring Managers

Before your next technical assessment cycle, ask these questions:

[ ] Does our assessment have a single correct answer, or multiple valid approaches?
[ ] Are we evaluating the reasoning process, or only the final output?
[ ] Could an AI generate a passing response in under 60 seconds?
[ ] Do we have follow-up probing to test depth of understanding?
[ ] Are we tracking how candidates build their answer, not just what they submit?
[ ] Would this assessment still produce useful signal if the candidate had full AI access?

If you answered "no" to more than two of these, your assessment is structurally vulnerable to AI cheating. Not because your candidates are dishonest — but because the format makes dishonesty trivially easy.

What We Built

AssessAI is designed around these principles. Recruiters paste a job description. AI generates tailored system design questions. Candidates answer in structured sections. The evaluation scores reasoning quality across five dimensions: problem framing, system decomposition, tradeoff analysis, scalability and edge cases, and user-centric design.

There is no single correct answer to game. There is no code to copy-paste. Follow-up probes test whether candidates understand their own designs. And the version history shows how the thinking evolved — not just what was submitted.

The assessment is structurally cheat-proof, not because we built better proctoring, but because we designed the format so that cheating does not help.

If you are tired of assessments where you cannot tell whether the candidate or the AI did the work, try one.

Start a free assessment at getassessai.com

Rohan Bharti is the founder of AssessAI. He builds tools for engineering teams that take hiring seriously.

AI Cheating Doubled on Coding Tests. Here's Why System Design Is Cheat-Proof.