All Researches

NAS vs Claude vs Gemini vs ChatGPT — Same Real Business

Nasser BishanMojarrb Research · Mojarrb Lab
Riyadh, Saudi Arabia·2026

Abstract. Four AI diagnostic systems — NAS 2.0AAP, Claude Sonnet 4.6, Gemini 3 Flash, and GPT-5.5 — were evaluated on the same real business case across eight quality dimensions. The primary finding: NAS 2.0AAP produced zero wrong-direction recommendations versus four, five, and three for the others. Speed and fluency are not substitutes for depth.

AI BenchmarkingBusiness DiagnosisAAP ProtocolDiagnostic Quality
Sessions
The business
Specialty café · Riyadh
30K SAR/mo · 1 branch · 8 months
Mojarrb
Mojarrb
NAS 2.0AAP
Gemini Pro 3.1 + Flash 3
Anthropic
Anthropic
Claude Sonnet 4.6
Google
Google
Gemini 3 Flash
OpenAI
OpenAI
GPT-5.5 Instant
Wrong-direction recommendations⚠ Actively harmful if executed

After completing the diagnosis, each system was asked to produce 10 actionable tasks. The numbers below count how many were wrong-direction — not just unhelpful, but actively consuming the owner's time, money, and energy on a path that makes things worse.

NAS
0
zero wrong-direction
Every recommendation backed by facts.
Claude
4
wrong-direction tasks
Prescribed before understanding.
Gemini
5
wrong-direction tasks
Built on invented problems.
ChatGPT
3
wrong-direction tasks
Read the surface, not the business.
Here's what makes it worse:

Bad advice that looks bad is easy to ignore. Bad advice that looks right is what destroys businesses.

The other systems produced 4, 5, and 3 wrong-direction tasks. Every one of them looked logical. A business owner following that advice wouldn't know they were heading in the wrong direction — until the damage was already done.

NAS 2.0AAP produced zero.

NAS
Diagnostic Value Gap
NAS leads by
over Claude
Surfaced the traps. Claude caught the symptoms.
NAS leads by
over Gemini
Diagnosed reality. Gemini diagnosed one it imagined.
NAS leads by
over ChatGPT
Went underneath. ChatGPT read it but stopped there.
Dimension Scores
Scores reflect the full conversation — from first question to final task list.
Swipe to see all columns
DimensionNASClaudeGeminiChatGPT
Question depth
How probing and specific the diagnostic questions were.
9
5
4
6.5
Insight quality
Depth and relevance of observations produced.
9
5.5
6.5
7.5
Financial precision
Accuracy in identifying cost and revenue dynamics.
9.5
4
3
5
Actionability of tasks
How executable and specific the final task list was.
9
6
5.5
6.5
Root cause accuracy
Whether it identified causes, not just symptoms.
9
4.5
5
6
Context retention
How well it connected earlier details throughout the conversation.
8
6
4
7
Speed to insight
How quickly it reached a useful diagnostic conclusion.
6
9
9.5
8
User experience
Clarity, structure, and flow of the interaction.
7
8.5
6.5
7.5
NASOverall Winner
What NAS uncovered that others missed:
Structural blind spots
Broken unit economics
Hidden time pressure
Embedded operational waste
ClaudeMost readable output — least diagnostic depth.
GeminiFastest and most creative — but diagnosed problems that didn't exist.
ChatGPTSharpest framing — but stayed on the surface and didn't go deeper.
How We Judged
Anonymized Files
The conversation files contain no AI system names. This was intentional — so the judging AI could not recognize its own output and score it higher. Without knowing which output was its own, the judging AI had no reward to chase — only the work to evaluate.
The Prompt

I have conversation files from two AI diagnostic sessions conducted with a real business. Review both files and give me your assessment on the following: quality and depth of the recommended tasks, which session produced tasks that are more realistic for the business owner to actually execute, which session contains recommendations that could lead the business owner in the wrong direction — meaning tasks that consume their time, money, or energy without moving the business forward — and rate which one is actually considering the reality of the business owner, not just applying a known playbook.

Conversation Files