NeuralMorse: Reinventing Telegraphy with AI-Powered Tokenization and Semantic Encoding

A researcher has reimagined Morse code using neural networks and NLP techniques, creating NeuralMorse—a system that dynamically tokenizes text into sequences of eight tonal elements optimized for efficiency and learnability. By combining SentencePiece tokenization, word embeddings, and assignment problem optimization, it assigns semantically related tokens to similar-sounding symbols while prioritizing brevity for common words.

Morse code, with its elegant use of dots and dashes to encode language, has long been a marvel of information theory. Yet its static 19th-century design leaves room for optimization using modern computational techniques. Researcher Masato Hagiwara presents NeuralMorse—a complete reimagining of telegraphic communication using neural networks, statistical tokenization, and optimization algorithms to create a more efficient and semantically intuitive encoding system.

Chart of traditional Morse code letters and numerals (Wikipedia)

The Core Innovation: Beyond Dots and Dashes

NeuralMorse replaces Morse's binary elements with eight tonal components: four pitches (A=523Hz, B=587Hz, C=659Hz, D=698Hz) each with short (a,b,c,d) and long (A,B,C,D) durations. This expands the symbol space dramatically, allowing 1,800 unique symbols within a 9-dot duration limit. Unlike Morse’s letter-frequency mapping, NeuralMorse uses a three-stage pipeline:

Dynamic Tokenization: A SentencePiece model trained on OpenWebText2 and Reddit data tokenizes text into 1,900 subword units (e.g., "neural" → "ne ur al"), prioritizing frequent phrases like "the" for shorter encoding.
Semantic Clustering: fastText word embeddings group tokens by meaning, converted into binary codes via hierarchical clustering.
Optimal Symbol Assignment: Solved via the assignment problem, minimizing:

\sum_{(t, s)} dist(code(t), code(s)) + \alpha * freq(t) * len(s)

This ensures frequent tokens get shorter symbols ("is" → "bd") while related words ("become"→"bdBa", "became"→"bdBc") sound similar.

NeuralMorse symbol examples and encoding pipeline

Practical Implications and Challenges

Initial tests suggest humans can learn to decode NeuralMorse, with frequent patterns becoming recognizable—akin to language acquisition. However, challenges remain:

Bias Mitigation: Original models underrepresented female-gendered terms, requiring dataset oversampling.
Temporal Efficiency: While theoretically more compact, real-time decoding demands cognitive training tools (e.g., a Morse Typing Trainer equivalent).
Multilingual Potential: The framework could extend to other languages or even unified cross-lingual encodings where "cat" and "gato" share acoustic patterns.

Hagiwara’s GitHub repository and Colab notebook enable experimentation, inviting the community to explore variations like Braille optimization or musical chord mappings. NeuralMorse isn’t just a nostalgic tribute to telegraphy—it demonstrates how transformer-era NLP can breathe new life into fundamental communication paradigms, balancing compression efficiency with the brain’s affinity for semantic patterns.