LLM poetry and the 'greatness' question

Two distinct approaches—Gwern's artisanal prompting and Mercor's scalable evaluation—are testing whether AI can produce poetry that transcends technical competence to achieve genuine artistic greatness, revealing a fundamental tension between particularity and pattern.

The question isn't whether large language models can write poetry. They can. They handle rhyme, meter, and metaphor with increasing sophistication. The real question is whether any LLM output rises to the level of great poetry.

Great poetry, by a definition refined over thirty years of reading, teaching, and writing, is both particular and universal. It emerges from a specific person, moment, and culture, yet resonates across time and distance. This resonance happens through an invisible network: poems activate prior reading in the mind of the reader, creating echoes between old and new work. That network is where the particular becomes universal.

LLMs have read most digitized poetry. When asked to write about grief, they draw on images and phrasings from earlier poems. But they don't yet have culture. And without culture, the path to greatness remains unclear.

Gwern's Workshop

Gwern stands as the most thoughtful explorer of how LLMs and poetry actually work. His experiments—completing William Empson's "This Last Pain," writing romantic verse, generating Pindaric odes—are required reading for anyone in the AI-poetry intersection.

The Empson project reveals the evolution of LLM capability. Early models couldn't follow instructions reliably. You'd provide a fragment; they'd continue it, or generate fake commentary, or fabricate a bibliography. Tokenization fractured words, breaking rhyme. The results had a "wonderful strangeness," but not control.

Later models like ChatGPT brought obedience but lost creativity. Trained via reinforcement learning from human feedback (RLHF) to satisfy human raters, they produced safe, sentimental, generic verse. Gwern calls this "mode collapse."

Creativity returned through three factors: reasoning models using chain-of-thought (giving models time to escape the "default basin of chatbot blandness"), continued capability scaling, and explicit attention to mode collapse at frontier labs. Moonshot's "rubric training" for Kimi K2 allowed qualitative feedback rather than binary preference.

Gwern's process mirrors human poetry workshops. For the Empson completion, he constructed a multi-stage prompt:

Analyze style, content, and intent of the original
Brainstorm 10+ diverse directions
Critique each, rate 1-5 stars
Write the best one
Critique and edit line by line
Generate a clean draft
Repeat at least twice

This is what poets do: write many rough drafts, murder your darlings, assess, publish. Gwern and his LLM sidekicks would fit right into any literary journal selection committee—except for one thing: discussion of the poet.

Committees discuss the poet's trajectory, mentors, how their work moves the field. The best indicator of future greatness is early work. Gwern sees this and treats different models as different strengths in a seminar of budding poets: Claude for taste and curation, o1-pro for brainstorming, Kimi K2 for critique. He matches model to task, revisits poems as tools improve, treating work as living documents.

The Pindaric Ode Project

Gwern's recent project to generate Pindaric odes "to lab animals, praising them and their unwitting yet noble sacrifices" demonstrates the full system. The pressure-cooker prompt (v4.2) defines form strictly:

Triadic structure: Strophe, Antistrophe, Epode with identical line counts in first two sections
Stress-based meter: Strophe 6 stresses (5-7 acceptable), Antistrophe 8 (7-9), Epode 4 (3-5)
Mandatory caesura marked with ||, serving as semantic hinge
Alliteration rules with density limits (2-3 stresses per sound target, 4 allowed, 5 flagged)
Enjambment requirements: Antistrophe needs one sentence spanning 4+ lines, 50% lines without terminal punctuation, 33% caesuras splitting phrases
Scansion comments on every line

The LLMs researched laboratory animal history, compiling databases of proper nouns and images. They created categories: geography (Vivarium, Laminar Flow), heroes (Laika, Dolly, OncoMouse), tribes (C57BL/6, Wistar Rat), priests (Abbie Lathrop, Louis Pasteur), rituals (LD50, Cervical Dislocation), concepts (The Warm Cage, Blood-Price).

This databank prevented generic imagery while keeping the poem in its lane. The LLMs generated multiple drafts, critiqued and rated each, rewrote with critiques, selected the best, then draft-critique-revise iteratively. For final brainstorming, Gwern prompted the model to evaluate as if submitting to Poetry magazine, then requested a "reviewer #2" report. This persona "unhobbles" the model's feedback—makes it more critical and energetic.

The output is very good. Occasionally great. Gwern's engineering of prompts is, in effect, writing poetry. He thinks like a poet: what form? What words? What's cliched? What will editors think?

Mercor's Scale

Mercor takes a different path. Tyler Cowen's conversation with founder Brendan Foody reveals Mercor hires poets at $150/hour to teach AI models. The logic: when AI labs want better poetry, Mercor finds the best poets to create evals and examples. Once poets teach models, that knowledge scales across billions of users.

Mercor isn't trying to appreciate fine poetry. Poetry is a test case for understanding what expert knowledge brings. The bet: training models to write better poetry can train them to do better legal drafts, medical diagnoses, financial analyses. Professional judgment (lawyer framing an argument) and aesthetic judgment (poet breaking a line) are computationally similar—both navigate unbounded decision spaces with no single correct answer, only better or worse based on expert consensus.

The process starts with a rubric, a scoring guide like teachers use. For poetry: reward appropriate structure, penalize clichéd imagery, penalize mixed metaphors, reward endings that reframe openings. These aren't subjective criteria. Experts take AI-generated responses that scored well, explain why the best is best. Rankings feed back into the rubric. Future models get tested against eval and rubric; matching expert preferences scores higher.

This is RLHF. Expert creates rubric => model generates => expert grades => rubric refined => model updates.

Mercor solves a "last mile problem." Current models generate average work. The gap between average and expert—what Foody calls the "last 25%"—is recognizing subtle errors or edge cases. Poetry evals are a compressed test of capabilities that matter commercially: stylistic control, emotional tone, constraint satisfaction, long-range coherence, nonliteral language. Improve poetry, improve ad copy, UX text, marketing emails, fiction, scripts, corporate communications.

The market value of poetry is small. But the indirect value of poetic capabilities is infinite. Foody's pitch for APEX is "measure the things customers actually care about." Poetry is a public benchmark for "this model feels creative and emotionally intelligent." Better poetry supports product stickiness and subscription revenue.

But Foody's "traction" metric mathematically incentivizes regressing to mean reader preference. This builds an engine for eliminating the "strangeness" Gwern preserves. Edge cases are errors to be pruned. For poets, edge cases are the poem.

The Greatness Problem

Aristotle wrote that poetry is more philosophical and more serious than history. History records what happened to specific individuals. Poetry takes a particular case and makes claims beyond it. Shakespeare didn't write a treatise on ambition; he wrote about Richard III and Macbeth. The ambition matters because it's embedded in those lives, cultures, situations, language.

Allegory and poetry go opposite directions. Allegory starts from abstract ideas like greed, invents characters to illustrate. Great poems start with particulars, gesture toward the universal.

LLMs are like allegory: algorithmic generation moves from general pattern toward manufactured particulars. Most readers can't always tell the difference. A well-formed allegory and good poem look similar on the page. This makes LLM outputs seductive; they inherit surface craft from training data.

Yeats's "For Anne Gregory" (1933) is a test case:

'Never shall a young man, Thrown into despair By those great honey-coloured Ramparts at your ear, Love you for yourself alone And not your yellow hair.'

'But I can get a hair-dye And set such colour there, Brown, or black, or carrot, That young men in despair May love me for myself alone And not my yellow hair.'

'I heard an old religious man But yesternight declare That he had found a text to prove That only God, my dear, Could love you for yourself alone And not your yellow hair.'

Any good LLM can produce competent variation. Swap yellow hair for blue eyes, freckles, billion-dollar dowry, large social media following. Keep antiphonal structure, final turn through theology or therapy. Get something that reads smoothly, feels like a poem.

The particularity that makes Yeats great isn't that Anne Gregory was real. She's a particular subject in a particular milieu, with particular relation to Yeats and social codes around beauty, hair, youth in County Galway. Particularity means all the ways this girl was embedded in her culture. Not every culture fixates on blond hair. Not every culture has young women bareheaded. Not every culture features an "old religious man" speaking archly about a young woman's looks.

A great deal of Yeats's culture is in the poem.

For Mercor's process, that layer of meaning is mostly an obstacle. Their poets define rubrics featuring good structure, emotional tone, technique for broad classes. The rubric likely doesn't register importance of particularity, culture, local knowledge. That's not the point.

Models can imitate poem patterns, swap tokens that fit syntactic and tonal molds, even gesture at cultural context if prompted. But without prompting from Gwern or a human, an LLM cannot originate a poem whose particularity pushes back on the pattern, a poem that belongs to a specific life and historical network, then radiates out.

Two Paths Forward

Gwern treats models as collaborators inside the workshop where particular poems are made. He partners for analysis, brainstorming, self-critique, drafting, redrafting, editorial argument. He matches different models to different roles, like a teacher matching students to exercises. He produces work that feels alive and might, over time and revision, move toward greatness because it remains anchored in specific formal problems and imaginative projects.

Mercor uses poems inside the workshop of generalized reward models. The object isn't any individual poem or poet. It's a set of signals reusable for law, medicine, consulting, marketing, power-user satisfaction. If greatness means a poem about one life resonating across cultures, Mercor's system can't capture that at scale. Particularity and culture cannot scale.

Rubrics and evals won't produce something for which a reader says: "I believe this poet captured this truth there, in a place that has nothing to do with me, and yet it touches me here." That's greatness, not desirability.

Gwern may get there.

Operationally, greatness is measured when tastemakers put poems in anthologies so generations can read them, tear out resonant ones, tape them to refrigerators. That's the test. The ice cream, as Gwern notes, can't be eaten for you.

#poetry #Large Language Models #creative AI #evaluation #Prompt Engineering