Markov chains are the original language models

Markov Chains are the Original Language Models By Elijah Potter --- Overview This article revisits a senior-year high school Linear Algebra project on Markov chains, republished with minor edits. The author reflects on current AI language model hype and shares a homemade Markov chain implementation for text auto-completion, exploring the basics and explaining the math behind Markov chains and their application to language modeling and text generation. --- AI Hype Cycle in Language Models The author identifies four mental stages when interacting with modern large language models: Amazement Fascination with conversing with computers like humans and endless possibilities. Frustration Realization of frequent errors and limited value, applicable mainly to mundane tasks. Confusion Despite moving on, pervasive social hype and questioning one’s stance. Boredom Oversaturation of models; a desire to understand fundamentals, preferring transparent, hands-on systems without "magic." The author finds this last stage compelling and chooses to explore Markov chains as a foundational language model. --- Markov Chains Explained Basic Concept A Markov chain models probabilistic sequences where the next state depends only on the current state, not past history. Example: Alice’s locations At the grocery store: 70% chance to go to the planetarium next hour, 30% to stay. At the planetarium: 90% chance to stay, 10% to return to the grocery store. This is represented as a transition matrix \( T \): | | Grocery Store (start) | Planetarium (start) | |----------|----------------------|---------------------| | Grocery Store (end) | 0.3 | 0.1 | | Planetarium (end) | 0.7 | 0.9 | Knowing Alice's current location vector \( \vec{s} \), the matrix-vector product \( T \vec{s} \) predicts the distribution for the next hour. Repeated multiplication predicts farther into the future: \[ T^n \vec{s} \] --- Application to Text Completion Dictionary: Maps words to indices (example: "orange" → 0, "fruit" → 1, etc.). Training Text: Translated to indices to track word-to-word transitions. Counting transitions: Build a matrix \( C \) counting how often each word follows another. Converting to probabilities: Normalize columns of \( C \) by column sums to get probability matrix \( M = DC \), where \( D \) is a diagonal matrix with reciprocals of column sums. Prediction: Given a last word, a vector \( \vec{s} \) with 1 at that word's index and 0 elsewhere is multiplied by \( M \) to get probable next words. --- Text-Generation Challenges Naive approach: Pick the most probable next word (or random from top options). Problem: Markov chains converge to a steady state distribution, causing repeated or dull text. Solution proposed: Introduce randomness by multiplying \( \vec{s} \) with a random diagonal matrix \( R \), then select word corresponding to maximum in \( R \vec{s} \). --- Interactive Demo The author implemented a Markov-chain text auto-completer in Rust and WebAssembly. Controls include keyboard or clicking suggested next words. The implementation demonstrates ideas but is not optimized for efficiency. --- Related Articles Do Not Type Your Notes Advice against typing notes for learning, as it didn't work for the author. Code Ages Like Milk Discusses software aging and its impact on development speed and contributor retention. Local First Software Is Easier to Scale Explains why local-first software often doesn’t require conventional scaling. --- Summary This article revisits Markov chains as transparent probabilistic models underpinning early language modeling. It walks readers through their mathematical formulation using transition matrices and probability vectors,