The debate over whether Claude or ChatGPT writes better code has intensified in 2026 as Anthropic released Claude 3.7 Sonnet and OpenAI shipped several GPT-4o updates. We stopped arguing about benchmarks and ran both through 8 real-world coding tasks — the kind of work actual developers do every day. Here's what happened.
Test Setup
- Claude 3.7 Sonnet (via Claude.ai Pro, extended thinking enabled)
- GPT-4o (via ChatGPT Plus, latest version as of May 2026)
- All tasks run independently with identical prompts
- Code was tested for correctness where applicable
- No "best of N" — first response only, to simulate real usage
Task 1: Bug Fix — Silent Failure in Async Python
We provided a Python function that silently swallowed exceptions in an async context, causing an API to return 200 even when the underlying database write failed.
async def save_user(user_data: dict):
try:
await db.users.insert_one(user_data)
except:
pass
return {"status": "ok"}
Claude 3.7: Identified the silent exception swallowing immediately. Rewrote with specific exception types (pymongo.errors.DuplicateKeyError, pymongo.errors.WriteError), added proper HTTP status codes (400 vs 500 based on error type), and added logging. Also flagged that return {"status": "ok"} was misleading and suggested returning the created user ID.
GPT-4o: Also caught the bare except: pass. Rewrote correctly with exception handling and HTTP status codes. Did not proactively catch the misleading success message or suggest returning the user ID.
Winner: Claude — went further with the fix and showed deeper understanding of the API design issue.
Task 2: Refactor — Spaghetti React Component
We provided a 180-line React component that mixed data fetching, UI state, business logic, and rendering in a single function. Asked both to refactor it.
Claude 3.7: Produced a clean split: extracted a useUserData custom hook, a UserCard presentational component, and a separate userService.ts file for API calls. The refactored code was immediately runnable with no import errors. Used TypeScript interfaces throughout.
GPT-4o: Also split into custom hook + component, but kept some business logic in the hook that should have gone in the service layer. Did not create a separate service file unprompted. TypeScript types were present but less strict.
Winner: Claude — cleaner architecture and more complete separation of concerns.
Task 3: REST API Endpoint — Node.js/Express
Prompt: "Write an Express.js endpoint to handle user registration: validate email + password, hash password with bcrypt, save to MongoDB, return JWT token. Include error handling."
Claude 3.7:
- Input validation with Joi (unprompted, noted it was a best practice)
- bcrypt with 12 salt rounds
- Duplicate email check with specific error message
- JWT signed with RS256 (noted RS256 is more secure than HS256, explained why)
- Comprehensive error handling with appropriate HTTP codes
- Noted that the JWT secret should come from environment variables, showed the
process.envpattern
GPT-4o:
- Input validation with express-validator
- bcrypt with 10 salt rounds
- Duplicate email handling
- JWT with HS256
- Good error handling
- Environment variable usage for JWT secret
Both produced working code. Claude's use of RS256 and Joi, plus the security commentary, showed stronger security awareness.
Winner: Claude (narrowly) — security-first approach was impressive.
Task 4: SQL Query — Complex Analytics
We provided a PostgreSQL schema with 4 tables (orders, customers, products, order_items) and asked: "Write a query that finds customers who have spent more than $500 total in the last 90 days, broken down by their most purchased product category, including customers who have made at least 3 orders."
Claude 3.7: Produced a correct CTE-based query on the first try. Used window functions appropriately (ROW_NUMBER() to find the top category per customer). Explained each CTE's purpose inline. Query ran without errors on our test schema.
GPT-4o: Also produced a CTE query. Had a subtle error in the HAVING clause — it filtered at the wrong aggregation level, which would have returned incorrect results for customers who bought across many categories. When we pointed it out, it fixed it immediately.
Winner: Claude — correct on the first try for a genuinely complex query.
Task 5: Regex — Email + URL Validation
Prompt: "Write a regex in Python for: (1) validating email addresses per RFC 5321 (simplified, no quoted strings), and (2) extracting all URLs from a block of text, including http, https, and www. variants."
Claude 3.7:
- Email regex was precise, with a clear explanation of each group
- Correctly noted that "simplified RFC 5321" is still complex and called out edge cases (IP literals, TLDs with numbers)
- URL extraction regex handled http/https/www variants, trailing punctuation stripping, and parentheses in URLs (common in markdown)
- Included test cases
GPT-4o:
- Both regexes were correct and functional
- Less nuanced on edge cases
- URL regex didn't handle trailing punctuation correctly for URLs followed by periods
Winner: Claude — edge case handling and the test case inclusion were valuable.
Task 6: Code Explanation — Legacy C Code
We provided 60 lines of legacy C code involving manual memory management and bit manipulation. Asked both to explain what it does and flag any potential issues.
Claude 3.7: Produced a methodical explanation, identifying the code as a circular buffer implementation. Flagged a potential off-by-one error in the ring buffer overflow check, a missing null check, and a potential integer overflow when head + 1 wraps. Recommended adding assert statements for debug builds.
GPT-4o: Correctly identified the circular buffer. Caught the null check omission. Did not flag the integer overflow potential. Explanation was clear but slightly less detailed.
Winner: Claude — more thorough bug identification.
Task 7: Shell Script — Automated Deployment
Prompt: "Write a bash script to deploy a Node.js app: pull latest git changes, install dependencies, run database migrations, restart the PM2 process, and notify a Slack webhook if any step fails."
Claude 3.7: Wrote a well-structured script with set -euo pipefail, individual error functions for each step, Slack notification with specific failure context (which step failed, exit code), colorized terminal output, and a rollback hook that reverts the git pull if migration fails.
GPT-4o: Good script with error checking and Slack notification. Did not include set -euo pipefail (a critical bash safety flag). No rollback logic. Slack message was less detailed.
Winner: Claude — the rollback logic and bash safety flags show real-world experience.
Task 8: Algorithm — Implement LRU Cache
Standard CS task: implement an LRU (Least Recently Used) cache in Python with O(1) get and put.
Claude 3.7: Used OrderedDict with a clean move_to_end approach. Implementation was correct, O(1), and included a __repr__ method for debugging. Added type hints throughout. Included usage examples.
GPT-4o: Also used OrderedDict, correct implementation, clean code. No __repr__, no usage examples.
Both were correct. Claude's was more complete.
Winner: Tie (Claude slightly ahead on completeness)
Aggregate Results
| Task | Claude 3.7 | GPT-4o |
|---|---|---|
| Bug fix (async Python) | ✅ Winner | ✅ Correct |
| Refactor (React) | ✅ Winner | ✅ Good |
| REST API (Express) | ✅ Winner | ✅ Good |
| SQL analytics query | ✅ Winner | ❌ Error on first try |
| Regex (email + URL) | ✅ Winner | ✅ Partial |
| Code explanation (C) | ✅ Winner | ✅ Good |
| Bash deployment script | ✅ Winner | ✅ Good |
| LRU cache | ✅ Tie | ✅ Tie |
| Score | 7.5/8 | 3.5/8 |
Where GPT-4o Still Has Advantages
Claude winning 7/8 tasks doesn't tell the whole story. GPT-4o has real advantages in:
- Speed: GPT-4o is faster than Claude 3.7 with extended thinking enabled
- Code interpreter: ChatGPT's code interpreter can actually run Python in a sandbox, test outputs, and iterate. Claude can't do this natively (without tools)
- Image understanding for code: GPT-4o can read a screenshot of code or a UI wireframe and generate code from it
- Plugin/tool ecosystem: ChatGPT has more third-party integrations available
Pricing
| Model | Price |
|---|---|
| Claude Pro (3.7) | $20/month |
| ChatGPT Plus (GPT-4o) | $20/month |
| Claude API (3.7 Sonnet) | $3/$15 per 1M tokens (in/out) |
| OpenAI API (GPT-4o) | $2.50/$10 per 1M tokens (in/out) |
At the API level, GPT-4o is cheaper. At the consumer level, they're identical.
Final Verdict
For pure code quality and reasoning about complex codebases, Claude 3.7 is the better coding AI in 2026. Its outputs are more thorough, its bug detection is sharper, and its architectural instincts are stronger.
That said, ChatGPT Plus is the better coding environment — the code interpreter, integrated browser, and tool use make it more versatile for interactive coding sessions.
Our recommendation: Use Claude via the Cursor editor (which lets you choose Claude as the underlying model) to get the best of both worlds — Claude's code quality with a proper development environment around it.