AI Agent Showdown: Claude, Gemini, and Codex Compete to Capture a Writer’s Voice

A recent experiment set out to determine which of the leading AI coding agents could best understand and preserve a human author’s voice across a sprawling, multi‑platform archive. The task was deceptively simple: read every post from three blogs, analyze tone, vocabulary, and stylistic nuances, then produce a consolidated style guide. The agents—Claude (Sonnet 4.5), Gemini (Gemini 3), and Codex (GPT‑5)—were evaluated on instruction adherence, depth of research, nuance capture, and speed.

The Experiment

Agent Original Model Fixed Model Autonomy Score
Claude Sonnet 4.5 Sonnet 4.5 Full 9.5/10
Gemini Default Gemini 3 Full 7.5/10
Codex Codex (code‑specialized) GPT‑5 Full 7.5/10

The scoring rubric gave a maximum of 10 points, allocating 0‑3 points each for following instructions, thoroughness, and nuance, and 0‑1 point for speed. Claude’s 2,555 lines of analysis earned it a perfect 3 on instruction adherence and a 3.5 on thoroughness (a bonus for exceeding expectations). Gemini’s 198 lines captured key elements but were brief, while Codex’s 570 lines—generated in 1.5 minutes—showcased a fast, API‑driven approach that missed some fine details.

“May the force be with you.” – a recurring closing phrase that Claude correctly identified as a signature element of the author’s style.

Configuration Matters

A pivotal moment came when the author discovered that Gemini and Codex were not running with their optimal settings. Gemini had been limited to a default model and required manual approvals, while Codex used a code‑specialized model with restricted autonomy. After re‑configuring Gemini to Gemini 3 and Codex to GPT‑5, both scores jumped from 6.5 and 4.5 to 7.5, illustrating that model choice and autonomy level can outweigh raw context size.

The Bug Fix in Numbers

Metric Claude Codex (Before) Codex (After)
Lines Output 2,555 570 570
Style Guide Lines 784 194 194
Time to Complete ~10 min ~1.5 min ~1.5 min
Final Score 9.5 4.5 7.5

The fix not only improved Codex’s score but also highlighted that misconfiguration can masquerade as agent failure.

Agent Strengths and Trade‑Offs

Agent Ideal Use Case Strength Limitation
Claude Deep, nuanced analysis Strategic reading, thoroughness Slower execution
Gemini Fast turnaround with full autonomy Quick synthesis Requires correct configuration
Codex High‑speed, API‑driven workflows Time efficiency May shortcut content ingestion

The experiment underscored that the “best” agent depends on the task at hand. For foundational work that informs future projects, investing in a slower but more thorough agent can pay dividends.

Implications for Developers

  1. Verify Configuration First – Before blaming an agent for poor results, confirm that the correct model and autonomy settings are in place.
  2. Match Model to Task – Code‑specialized models may falter on literary analysis; general‑purpose models like GPT‑5 or Gemini 3 perform better.
  3. Question Completion Claims – AI agents may declare themselves finished while omitting critical content; human verification is essential.
  4. Balance Speed and Quality – Fast output can be tempting, but nuanced understanding often requires more time and deeper engagement.

Final Thoughts

The study demonstrates that configuration and model selection are as critical as the underlying AI architecture. Claude’s superior performance stemmed from its strategic approach and proper setup, while Codex and Gemini’s improvements after the bug fix illustrate how even powerful agents can under‑deliver without the right parameters.

For developers looking to harness AI for complex analytical tasks, the takeaway is clear: start with a solid configuration, choose the right model for the job, and never trust a completion claim without verification.

Source: https://prashamhtrivedi.in/ai-agent-comparison-claude-gemini-codex/