Language Model Tokenization Fails on Armenian Script, User Report Shows
#LLMs

Language Model Tokenization Fails on Armenian Script, User Report Shows

AI & ML Reporter
4 min read

A simple social media query reveals how low-resource language support remains a fundamental weakness in commercial LLMs, with Armenian text causing complete system failure in Anthropic's Claude.

A user report from November 25, 2025 shows that providing Armenian text input to Anthropic's Claude AI assistant causes the system to completely break down. Dyusha Gritsevskiy posted on X (formerly Twitter) asking "guys why does armenian completely break Claude" alongside a screenshot of the failure.

This isn't just a localization bug. It's a window into how modern language models handle scripts that fall outside their primary training distribution.

What's Actually Happening

When users input Armenian text—specifically the Armenian alphabet, which contains 39 distinct characters—Claude appears to encounter a processing error that prevents any response generation. The failure isn't partial; according to the report, it "completely break[s]" the model.

This type of failure typically occurs at the tokenization layer. Large language models don't read text like humans do. They break input into tokens—subword units that the model has learned to process. Most commercial LLMs, including Claude, are primarily trained on English and other high-resource languages. Their tokenizers are optimized for scripts like Latin, Cyrillic, and common CJK characters.

Armenian uses its own unique Unicode block (U+0530–U+058F). If the tokenizer wasn't properly trained on Armenian text, or if the vocabulary doesn't include Armenian character sequences, the model can't convert the input into meaningful token IDs. The result is either garbage output or, in this case, a complete system failure.

The Technical Root Cause

Modern LLM tokenizers like Byte-Pair Encoding (BPE) or SentencePiece work by iteratively merging the most frequent character pairs in the training corpus. For Armenian:

  1. Low training distribution: Armenian appears minimally in Common Crawl and other web scrapes used for LLM training. The tokenizer likely never learned Armenian character combinations.
  2. Unicode handling: Armenian letters are in a separate Unicode range. If the tokenizer's preprocessing pipeline doesn't properly handle this range, it may produce invalid token sequences.
  3. Vocabulary gaps: Even if the tokenizer can process Armenian characters, the model's embedding layer needs to have learned representations for those tokens. Without proper training data, these embeddings are either missing or randomly initialized.

The "complete break" suggests the error occurs before generation—likely during input encoding. If the tokenizer returns an empty sequence or invalid tokens, the model has nothing to work with.

Why This Matters

This failure exposes a broader pattern in AI development: the gap between marketing claims of "multilingual" capability and actual performance on low-resource languages.

Anthropic markets Claude as capable of understanding multiple languages. However, this incident reveals that basic support for an entire language family—Armenian, Kurdish, and other languages using that script—may be non-existent or brittle.

For practical applications, this means:

  • Translation services using Claude cannot handle Armenian
  • Content moderation systems would fail to flag Armenian spam or abuse
  • Educational tools built on Claude can't assist Armenian speakers

Broader Context in AI

This isn't unique to Claude. Similar issues have been reported with other models:

  • GPT-4 struggles with certain Indic scripts
  • Many models perform poorly on African languages
  • Even major European languages like Hungarian or Czech show degraded performance compared to English

The problem stems from data distribution in training sets. The Pile, Common Crawl, and other major datasets are overwhelmingly English-centric. Even "multilingual" models often prioritize the top 20-50 languages by web presence.

Potential Solutions

For users: Currently, there's no workaround. Armenian text simply won't process.

For model providers: The fix requires:

  1. Expanded training data: Including Armenian web content, books, and other text in future training runs
  2. Tokenizer updates: Retraining tokenizers on more diverse scripts
  3. Unicode preprocessing: Ensuring robust handling of all Unicode ranges

For the community: Projects like BLOOM and OpenLLM have made efforts toward more inclusive language support, but commercial models lag behind.

What Comes Next

Anthropic will likely address this through a model update. However, the fundamental issue—data scarcity for low-resource languages—won't be solved quickly. True multilingual support requires deliberate collection and curation of non-English text, which is expensive and time-consuming.

Until then, users of Armenian and other underrepresented languages will continue to encounter these barriers, reminding us that AI's "universal" capabilities are still very much a work in progress.

Featured image

This article was generated based on a user report from X (formerly Twitter). For the latest updates on Claude's language capabilities, check Anthropic's official documentation.

Comments

Loading comments...