Notes on Computer-Generated Text

Contents of this page:

Claude Shannon

Claude ShannonMy prediction (remember, you heard it here first) is that the name of Claude Elwood Shannon (1916- ) will be a household word 250 years from now. Historians of the future will hail Shannon as the father of the Information Age, much as we now see Newton as the author of the Scientific Revolution, some 250 years previously. Both Shannon and Newton created a solid mathematical foundation for the key technologies of their age (Shannon for information and communication theory, Newton for mechanical physics), thus turning them into true sciences and unleashing vast changes in knowledge, industry, and society.

Shannon's master's thesis (MIT, 1937) investigated the use of electrical circuits to model logical statements. Shannon was one of the few people of his day familiar with both electrical engineering and the mathematical logic of Boolean algebra. He showed how electrical switches could be used to carry out calculations and detailed instructional procedures, thus foreshadowing the electronic computer.

After MIT, Shannon joined Bell Telephones in 1941 as a research mathematician. In 1949, with the strong urging of his colleagues, he published A Mathematical Theory of Communication with Warren Weaver. This book (probably inspired by his wartime work in cryptography) gave the first mathematical underpinning to the study of communications.

One of Shannon's major discoveries is that information has a very precise relationship to entropy. Entropy - a measure of the disorder of a system, and before Shannon used only in classical thermodynamics - can be seen as a lack of information about a system. This relationship implies that disordered systems can be "cleaned up" by using information about the system. It also shows that information about the world is never "free"; every bit of information gathered causes a tiny but definite disordering of the system under investigation.

Shannon also showed that much of any "real world" communication is taken up with redundancy. (Shall I repeat that in different words?) He analyzed a huge number of communications, from code transmissions to telephone conversations to James Joyce novels, in order to understand the relationship between a) the message intended for transmission and b) the redundant information tacked on to ensure that the message is understood correctly. Redundancy is crucial for clear communication in a "noisy" environment, but when noise is low the redundancy can be stripped out and the message highly compressed (as PKZIP users can attest).

In this connection, Shannon examined the frequency of word correlations in the English language. Pairs of words which often appear together (for instance, "gutless" and "wonder") show a higher degree of redundancy than less common pairs (perhaps "elegant" and "quicksand"). Shannon showed that a randomly generated string of words could sound remarkably like meaningful English, so long as each word had a high correlation with the word before it. The resemblance to English is even greater if the nonsense string is generated using word triplets, rather than word pairs. The Shannonizer uses this effect to analyze texts and mimic their style using word pairs

Not surprisingly, perhaps, there's a fair amount of material about Claude Shannon in the Web. The Shannon links cover the gamut from the serious to the seriously silly. You'll learn that Shannon's grandfather was reputed to have invented the washing machine, and that Shannon himself invented motorized pogo sticks and chess-playing robots. A lifetime juggler, Shannon even devised a unicycle designed for riding while juggling, not to mention a tiny clockwork circus where three clowns juggle tiny clubs, balls and rings simultaneously... more grist for the Shannon legend in the 22nd century, no doubt.

Word pairs

The Shannonizer uses a very simple trick to mimic different styles of English: it simply analyzes the frequency with which words follow each other in pairs. Take any word -- let's say the word "no." When Raymond Chandler is writing, the following word is more likely to be "sleep" or "fear"; when Miss Manners is writing, it's more likely to be something like"doubt" or "regard." The Shannonizer builds up a table of probabilities for all the word-pairs in a text to be mimicked. It can then generate text which is random and nonsensical, but maintains the same probability between subsequent pairs of words. (Actually, the Shannonizer combines two different styles; the style of the "editor," and the style of the document that you've chosen to Shannonize.) The result can sometimes be very striking, even if silly.

Claude Shannon actually achieved an effect like this without a computer. Taking a novel, he picked a first word at random, then found the next place in the novel where this word appeared. He added the following word to the generated text, then repeated the process until he had a fair-sized sentence such as:

The head and in frontal attack on an English writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected.

(This is the kind of geeky fun that Shannon seemed to have all the time. He got his friends to join in to all sorts of odd, semi-surreal word games like this and somehow turned the results into ground-breaking research. He also managed to retire at the age of fifty. It's enough to make you want to take up juggling.)

Presumably, the mimicry would be even more accurate if we analyzed word triplets instead of just word pairs. This would take up a good deal more computing time. I also suspect that such a "third-order" approximation might make the Shannonizer a little too accurate to be amusing!

But is that computer actually saying anything?

This topic to be developed in the very near future....


[Home] . . [How it works] . . [Claude Shannon] . . [Perl & CGI] . . [More fun & games]
[What do you think?] . . [Help!]
This page is a cached copy on a different server than the original (http://www.nightgarden.com/infosci.htm) site. The original site should be used at all times except when unavailable. No copyright was included in the original document when this was duplicated, but I will respect the original author, if they requst for me to remove this copy. Maintainer of locally cached copy can be reached here.