Susam Pal feeds his entire 24-year blog archive into a minimalist Markov text generator, creating a fascinating exploration of language modeling and the boundaries between coherent thought and algorithmic nonsense.

24 Years of Blog Posts, One Markov Model, and the Birth of Digital Dadaism

In a fascinating experiment that blurs the line between coherent thought and algorithmic nonsense, programmer Susam Pal has fed his entire 24-year blog archive into a minimalist Markov text generator. The result? A collection of sometimes coherent, often bizarre, and occasionally profound digital creations that reveal the hidden patterns in a lifetime of writing.

The experiment centers around a small program called "Mark V. Shaney Junior," a modern implementation of the legendary Markov text generator from the 1980s. What makes this experiment particularly compelling is not just the technical implementation, but the sheer volume of data being processed—over 200 posts consisting of approximately 200,000 words spanning more than two decades of technical writing.

The Art of Exploratory Programming

Pal describes himself as an avid "exploratory programmer," someone who writes computer programs not to solve specific problems but simply to explore particular ideas for recreation. "I must have written small programs to explore Markov chains for various kinds of state spaces over a dozen times by now," Pal explains. "Every time, I just pick my last experimental code and edit it to encode the new state space I am exploring."

This approach has resulted in hundreds of tiny experimental programs scattered across Pal's disk. Occasionally, one of these exploratory projects receives the finishing touches needed for public consumption, wrapped up in a proper Git repository with documentation and shared on platforms like GitHub and Codeberg. The Mark V. Shaney Junior program is one such project.

Understanding the Markov Model

At its core, the Markov model employed here is remarkably simple. As Pal describes it, "By default, this program looks at trigrams (all sequences of three adjacent words) and creates a map where the first two words of the trigram are inserted as the key and the third word is appended to its list value."

This approach captures the statistical relationships between words, allowing the model to generate new text based on learned patterns. The text generation process begins with a random pair of words, then selects a subsequent word from the list of possible followers, creating a chain of statistically plausible word sequences.

"The model captures each pair of adjacent words along with the words that immediately follow each pair," Pal explains. "The text generator then chooses a key (a pair of words) at random and looks for a word which follows. If there are multiple followers, it picks one at random."

This simplicity is precisely what makes the model both elegant and revealing. "That is pretty much the whole algorithm," Pal notes. "There isn't much more to it. It is as simple as it gets. For that reason, I often describe a simple Markov model like this as the 'hello, world' of language modelling."

The Algorithmic Muse

When applied to Pal's extensive blog archive, the model produces results that range from the coherent to the delightfully absurd. Consider this example generated from the default 2nd-order model:

while a query replace operation is approved by the user. The above
variable defines the build job. It can be incredibly useful while
working on assembly language and machine code. In fact, all internal
resources like the result to refine the search prompt changes from
bck-i-search: to fwd-i-search:. Now type C-SPC (i.e. ctrl+space) to
set a mark causes Emacs to use 32-bit registers like EBP, ESP,
etc. Thus the behaviour is undefined. Such code may behave differently
when compiled with the readily available GNU tools like the shape
of 8. Flipping "P" horizontally makes it a proper quine: cat $0

This text combines technical concepts from Pal's programming posts—Emacs commands, assembly language, build systems—into a surreal tapestry that somehow maintains a technical coherence despite its nonsensical flow.

Another example reveals the model's ability to stitch together disparate topics:

Then open a new Lisp source file and the exact answer could harm
students' self-esteem. Scientists have arbitrarily assumed that an
integral domain. However, the string and comment text. To demonstrate
how a build job can trigger itself, pass input to standard output or
standard error), Eshell automatically runs the following command in
Vim and Emacs will copy the message length limit of 512 characters,
etc. For example, while learning to play the game between normal mode
to move the point is on an old dictionary lying around our house and
that is moving to the small and supportive community

Pal notes the particularly amusing juxtaposition: "No, I have never written anywhere that opening a Lisp source file could harm anyone's self-esteem. The text generator has picked up the 'Lisp source file' phrase from my Lisp in Vim post and the 'self-esteem' bit from the From Perl to Pi post."

The Order of Chaos

The model's behavior changes dramatically when adjusting its "order"—the number of words used as keys for generating subsequent text. By default, the order is 2 (using word pairs as keys). Increasing this value to 3 or 4 produces more coherent results:

It is also possible to search for channels by channel names. For
example, on Libera Chat, to search for all channels with "python" in
its name, enter the IRC command: /msg alis list python. Although I
have used Libera Chat in the examples above, there are plenty of
infinite fields, so they must all be integral domains too. Consider
the field of rational numbers Q . Another quick way to arrive at this
fact is to observe that when one knight is placed on a type D square,
only two positions for the second knight such that the two knights

Pal notes that with higher order settings, "Except for a couple of abrupt transitions, the text is mostly coherent." However, there's a sweet spot to be found. "We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factual because it begins to quote large portions of the blog posts verbatim. Not much fun can be had like that."

The model can even generate text from a specific starting prompt:

Finally we divide this number by a feed aggregrator for Emacs-related
blogs. The following complete key sequences describe the effects of
previous evaluations shall have taken a simple and small to contain
bad content. This provides an interactive byte-compiled Lisp function
in MATLAB and GNU bash 5.1.4 on Debian is easily reproducible in
Windows XP. Older versions might be able to run that server for me it
played a significant burden on me as soon as possible. C-u F: Visit
the marked files or directories in the sense that it was already
initiated and we were to complete the proof.

"Apparently, this is how I would sound if I ever took up speaking gibberish!" Pal quips.

The Hidden Architecture of Thought

Beyond the amusement value, this experiment reveals something profound about how we construct language and ideas. The Markov model, in its simplicity, exposes the statistical underpinnings of writing style and topic association.

When the model generates coherent technical passages, it demonstrates that our writing follows predictable patterns—certain word combinations appear more frequently, and certain topics cluster together in predictable ways. The abrupt transitions between topics highlight the associative nature of thought and writing, where one concept can lead to another through seemingly arbitrary connections.

This has implications for understanding language models more broadly. Modern large language models operate on similar principles but with vastly increased complexity and training data. Pal's minimalist implementation serves as a perfect "hello world" example for understanding how these systems work at their most fundamental level.

The experiment also speaks to the nature of creativity itself. While the Markov model doesn't truly "create" in the human sense, it produces text that can occasionally surprise and delight. This suggests that creativity might be, at least in part, a matter of pattern recognition and association—a process that algorithms can simulate, if not replicate.

The Digital Dadaist

In feeding his life's work to an algorithm, Pal has become a digital Dadaist, creating through constraint and randomness rather than conscious intent. The resulting text—sometimes coherent, sometimes nonsensical, always fascinating—challenges our understanding of authorship, creativity, and meaning.

As we continue to develop increasingly sophisticated language models, experiments like this serve as both educational tools and artistic statements. They remind us that beneath the complexity of human language lies a mathematical structure, and within the randomness of algorithmic generation can emerge moments of unexpected coherence.

In the end, Pal's experiment is more than just a technical demonstration; it's a meditation on the nature of writing, thought, and the strange alchemy that occurs when human expression meets algorithmic processing. The generated text may be gibberish, but in its creation, we find a mirror reflecting the hidden patterns in our own minds.

#MarkovModel #LanguageModeling #GenerativeAI

24 Years of Blog Posts, One Markov Model, and the Birth of Digital Dadaism

24 Years of Blog Posts, One Markov Model, and the Birth of Digital Dadaism

The Art of Exploratory Programming

Understanding the Markov Model

The Algorithmic Muse

The Order of Chaos

The Hidden Architecture of Thought

The Digital Dadaist

Comments