AI's Struggle with Sheet Music: When Lilypond Meets Machine Learning

A fascinating experiment reveals AI's surprising capabilities and limitations in music transcription, from correctly identifying Bach to hallucinated harmonies.

When I first heard about using AI to generate Lilypond code, I was skeptical. Lilypond is a TeX-like typesetting language for sheet music that has a steep learning curve and a relatively small community of users. How could AI possibly understand such a niche domain well enough to produce usable output? The results of my experiments were both disappointing and fascinating, revealing the current limitations of AI in specialized domains while also showcasing some unexpected capabilities.

The Lilypond Experiment

I've had surprisingly good results asking AI to generate Lilypond code for music theory posts, like the one on the James Bond chord. Given the obscurity of Lilypond, there can't be that much publicly available code to train on. This made me curious about how well AI would work if I uploaded an actual image of sheet music and asked it to produce corresponding Lilypond code.

I tested this with two images: one classical excerpt and one jazz piece. Using the same prompt for both images with Grok and ChatGPT: "Write Lilypond code corresponding to the attached sheet music image."

Classical Music: Bach Meets AI

When I uploaded a Bach excerpt, Grok's output was hilariously bad from a transcription standpoint. The generated code turned one measure into eight, bearing no resemblance to the original. However, Grok correctly inferred that the excerpt was by Bach, and the music it composed was actually in the style of Bach—just not what I asked for. It's as if the AI recognized the composer and decided to write its own Bach-inspired piece instead of transcribing the given music.

ChatGPT's attempt was even more bizarre. Not only did it hallucinate notes that weren't in the original, but it hallucinated in two-part harmony! The AI seemed to have decided that Bach's music should be more elaborate than it actually was, adding layers that transformed a simple excerpt into something far more complex.

Jazz Standards: Chords vs. Melody

For the jazz example, I was particularly curious about how the AI would handle chord symbols. Grok's results were interesting: the notes were almost completely unrelated to the original, but the chords were correct. Grok used the notation Δ for major 7th chords, which is a common alternative to writing "maj7." More impressively, Grok correctly identified the song title and even credited Johnny Burke and Jimmy Van Heusen for the lyrics and music.

ChatGPT's jazz transcription showed a different kind of creativity. While the chords were correct, the notes bore only some similarity to the original. ChatGPT took the liberty of changing the key and time signature, and the last measure had seven and a half beats—a clear violation of standard musical notation. When I asked what song the fragment was from, ChatGPT identified it as "Misty," demonstrating that it could recognize the source material even when its transcription was wildly inaccurate.

What This Tells Us About AI and Music

These experiments reveal several important insights about AI's current capabilities with specialized domains like music notation:

First, AI can recognize and identify music styles and composers even when it cannot accurately transcribe the notes. This suggests that the models have learned patterns and characteristics of different composers' work, even if they haven't mastered the technical skill of accurate transcription.

Second, AI tends to hallucinate when faced with tasks it cannot complete accurately. Rather than admitting uncertainty or producing a rough approximation, the models often generate confident but incorrect output. This is particularly evident in the two-part harmony hallucination and the creation of measures with seven and a half beats.

Third, AI shows surprising competence in understanding musical structures like chord progressions and song identification, even when the detailed notation is wrong. The correct identification of "Misty" and the accurate chord symbols suggest that the models have learned important aspects of music theory and common practice.

The Path Forward

These results point to both the promise and limitations of current AI for specialized tasks. While AI cannot yet reliably transcribe sheet music from images, it demonstrates an understanding of musical context, style, and structure that could be valuable for composers and musicians.

The gap between recognizing that something is Bach and actually transcribing Bach accurately is significant. It suggests that while AI has learned patterns from its training data, it hasn't necessarily learned the precise skills needed for accurate notation. This is perhaps unsurprising given that Lilypond is a specialized language with limited online presence.

For now, AI-generated Lilypond code remains more of a curiosity than a practical tool. But the ability to identify composers, recognize songs, and understand chord progressions from sheet music images hints at the potential for more sophisticated music AI tools in the future. As models continue to improve and perhaps gain access to more specialized training data, we may see AI that can not only recognize Bach but also accurately transcribe his music—and perhaps even compose in his style with greater fidelity to the original.

The journey from recognizing musical patterns to accurately transcribing them is still ongoing, but these experiments show we're making progress, even if that progress sometimes takes the form of hilariously wrong but stylistically appropriate compositions.