2025's Free AI Chatbots: How ChatGPT, Copilot and Grok Stack Up in Rigorous Testing

ZDNET's exhaustive evaluation of eight leading free AI chatbots reveals surprising leaders and unexpected capabilities. ChatGPT maintains its edge, while Microsoft Copilot and xAI's Grok deliver standout performances in specific domains, proving free tiers now offer substantial power for developers and general users alike.

The AI chatbot landscape has evolved dramatically since ChatGPT's seismic debut in 2022. Today's free offerings aren't just curiosities—they're powerful tools reshaping how developers prototype, researchers explore ideas, and professionals streamline workflows. ZDNET recently subjected eight major players—ChatGPT, Microsoft Copilot, xAI Grok, Google Gemini, Perplexity, Anthropic Claude, DeepSeek, and Meta AI—to 112 rigorous tests across text generation, coding, image creation, and specialized tasks. The results reveal a maturing market where free tiers deliver unprecedented value, though with distinct strengths and weaknesses.

Methodology: Beyond Benchmarks to Real-World Application

ZDNET's testing avoided abstract model specifications in favor of practical, developer-relevant scenarios:

Ten text-based challenges spanning summarization, academic explanation, mathematical reasoning, cultural analysis, literary critique, travel planning, emotional support, translation, coding, and long-form storytelling
Four image-generation prompts testing visual creativity and adherence to complex specifications
Evaluation focused on accuracy, creativity, practicality, and user experience—all conducted without logins where possible to assess true free-tier accessibility

"We experienced almost no throttling through our series of 10 back-to-back prompts. The second surprise was how much the AIs let you do without requiring you to create an account. The third big surprise was just the overall quality of responses," noted Senior Contributing Editor David Gewirtz.

Top Performers: Where Each Excels

OpenAI ChatGPT (Score: 109/120)
The pioneer retains its crown with exceptional all-around performance. It aced child-friendly explanations, mathematical sequences, cultural discussions, and coding tests—generating functional JavaScript regex solutions. Its 1,500-word storytelling capability demonstrated strong narrative coherence, though it slightly undershot the word count. Image generation proved robust, accurately rendering complex prompts like a Back to the Future-themed scene. Weaknesses included occasional login prompts and a web lookup misfire redirecting to Taiwanese news.
- Text: 91/100 | Images: 18/20
- Pro: Deep ecosystem support, reliable coding, strong images
- Con: Aggressive login nudges, occasional web lookup errors
Microsoft Copilot (Score: 97/120)
A standout for developers in Microsoft ecosystems, Copilot delivered polished responses with minimal login friction. It uniquely identified Boston events matching a March travel timeframe and offered practical job interview strategies beyond platitudes. Coding results disappointed given Microsoft's developer tools pedigree—edge case handling and performance optimizations were lacking. Image generation was painfully slow (~5 minutes per image), and it blocked the Back to the Future prompt on copyright grounds.
- Text: 87/100 | Images: 10/20
- Pro: Seamless Microsoft integration, insightful responses
- Con: Sluggish images, complex premium tiers, coding inconsistencies
xAI Grok (Score: 96/120)
The dark horse contender delivered the most human-like travel itineraries, complete with pricing, weather considerations, and personalized dining recommendations (even spotlighting Boston's historic Union Oyster House). Its charming quirk: appending ELI5 (Explain Like I'm 5) explanations to most responses, including coding solutions that clarified bug fixes. Story generation hit the exact 1,500-word target. Image quality suffered outside Twitter/X logins, and coding had minor whitespace/decimal bugs.
- Text: 86/100 | Images: 10/20
- Pro: Exceptional personalization, conversational tone, zero login nagging
- Con: Image access tied to X/Twitter, inconsistent coding execution
Google Gemini (Score: 95/120)
Despite Google's AI ambitions, Gemini placed fourth—a significant disappointment. It excelled at factual queries (math sequences, theme analysis) and produced best-in-class images via its Nano Banana model in seconds. However, it failed spectacularly at translating its own Latin output back via Google Translate—an ironic stumble. Travel planning felt robotic ("history mornings, tech afternoons"), and web summarization ignored specified sources. Login requirements hampered image testing.
- Text: 77/100 | Images: 18/20
- Pro: Superb visual generation, deep Google integration
- Con: Login-dependent images, poor itinerary structuring, translation flaws
Perplexity (Score: 93/120)
Billed as an AI search engine, Perplexity uniquely surfaced sources upfront—valuable for developers validating information. However, it deviated from requested summarization tasks and delivered a phoned-in travel plan (suggesting "visit Google's Cambridge office" as a highlight). Coding was adequate but unremarkable. Free-tier image generation defaulted to web results unless logged in.
- Text: 81/100 | Images: 12/20
- Pro: Transparent sourcing, clean interface
- Con: Superficial responses, limited image generations, frequent login prompts

Key Takeaways for Technical Users

Free Tiers Are Viable: Resource limits were rarely encountered during intensive testing, making these tools practical for daily prototyping and research.
Specialization Over Universality: Grok's itineraries and Gemini's images prove niche excellence often trumps general competence.
Coding Isn't Guaranteed: Even top performers like Copilot generated buggy code—developers should validate outputs rigorously.
The Login Tradeoff: Privacy-conscious users face friction; Grok and Copilot were least aggressive about credentials.

The Bottom Line

Free AI chatbots have evolved from novelties into genuinely useful tools. While ChatGPT remains the most versatile option, Copilot's integration depth and Grok's surprising empathy demonstrate specialization matters. For developers, these tools offer on-demand brainstorming partners, but as Gewirtz cautions:

"Don't assume correctness. The best chatbot is the one you temper with your own expertise." As open-weight models proliferate and competition intensifies, this free-tier renaissance hints at an increasingly accessible—but discerning—AI future.

Source: ZDNET comprehensive testing of eight AI chatbots (October 2025), conducted by David Gewirtz. Full methodology and prompt details available in original article.