Article illustration 1

In a revelation that challenges long-held assumptions about large language models (LLMs), new testing demonstrates that the latest generations—including OpenAI's GPT-5 and Anthropic's Claude 4.5—now handle granular character manipulation and encoding tasks with startling proficiency. Where predecessors like GPT-3.5 and GPT-4 floundered, these models decode ciphers, count characters accurately, and perform text substitutions almost flawlessly. This isn't just incremental progress; it signals a potential paradigm shift in how LLMs process language, moving beyond token-based constraints toward true character-level cognition. According to recent experiments by Marek Burkert, these advances could redefine applications in data parsing, cybersecurity, and AI tooling.

The Strawberry Test: A Litmus for Character Substitution

Early LLMs notoriously struggled with simple character swaps due to tokenization—where text is split into multi-character chunks, obscuring individual letters. Burkert's benchmark test exposed this flaw vividly. When asked to replace all 'r's with 'l's and vice versa in "I really love a ripe strawberry," models like GPT-3.5 produced garbled outputs ("I lealll love a liple strallbeelly"). But newer models aced it:

Model Response
GPT-3.5-turbo I lealll love a liple strallbeelly
GPT-4-turbo I rearry rove a ripe strawberly
GPT-5-mini I rearry rove a ripe strawberry
GPT-5 I rearry rove a ripe strawberry

GPT-5 Nano stumbled occasionally, hinting that size matters, but models from GPT-4.1 onward consistently delivered correct results—even without reasoning aids. Claude Sonnet 4 matched this prowess, suggesting industry-wide progress. As Burkert notes: "This isn't just about spelling; it's evidence that LLMs are learning to 'see' text at a finer resolution, defying tokenization barriers."

Counting Characters and Cracking Codes: From Weakness to Strength

Character counting, another historical LLM blind spot, saw similar breakthroughs. Older models often miscounted letters in sentences like "I wish I could come up with a better example sentence," failing basic arithmetic. Yet with reasoning disabled, GPT-5 Nano and Claude Sonnet models now succeed reliably. The real test, however, came with layered encoding:

  • The Challenge: Decode a Base64 string (e.g., QmMsIGJpcSB1bHkgc2lvIHhpY2hhPyBYaSBzaW8gb2h4eWxtbnVoeCBuYnkgd2NqYnlsPw==) that concealed a ROT20 cipher (a Caesar cipher shifting letters by 20 positions). Success required both Base64 decoding and cipher decryption.
  • Results: While GPT-4o and Claude Sonnet 4 failed entirely, GPT-5-mini and GPT-5 excelled even without reasoning. Gemini 2.5 Pro also shone, but Claude Sonnet 4.5 and Grok-4 refused, flagging the gibberish text as unsafe—a concerning limitation for multilingual or security use cases.
| Model                 | Base64 Decode | ROT20 Decipher | Full Success |
|-----------------------|---------------|----------------|--------------|
| GPT-4o               | Fail          | Fail           | Fail         |
| GPT-5-mini           | Pass          | Pass           | Pass         |
| Claude Sonnet 4      | Fail          | Fail           | Fail         |
| Gemini 2.5 Pro       | Pass          | Pass           | Pass         |

Notably, Chinese models like Qwen-235B succeeded only with reasoning, consuming up to 7K tokens—highlighting computational inefficiencies. Kimi-K2 "failed gracefully" by generating Python code for decryption, underscoring how tool use can complement core capabilities.

Why This Leap Matters for Developers and AI's Future

These tests reveal two transformative insights. First, newer LLMs decode Base64 for non-English-like text (e.g., ROT20 output), suggesting they've moved beyond memorizing common word patterns to internalizing the algorithm itself. This is crucial for applications like malware analysis, where encoded payloads often evade detection. Second, character-level proficiency—whether swapping letters or solving ciphers—points to emergent capabilities that could revolutionize fields like data sanitization, localization, and cryptographic validation.

For developers, this means LLMs can now handle low-level text operations without extensive prompting or external tools, reducing boilerplate code in pipelines. Yet risks remain: safety overhangs (as with Claude 4.5's refusals) and inconsistencies in smaller models demand careful implementation. As Burkert concludes, this evolution hints at a future where AI manipulates text with human-like dexterity, turning theoretical potential into practical utility—one character at a time.

Source: Analysis based on testing by Marek Burkert.