Don Knuth email collection now available for research
Stanford Libraries’ Department of Special Collections and University Archives is pleased to announce that the email collection of Don Knuth has been processed. This collection consists of email from January 1999 through January 2019. Users can preview Knuth’s email corpus via Stanford's ePADD discovery website. The full text of the emails is only accessible on a workstation in the Field Reading Room, which is open to all members of the general public.
Content of the collection
Donald Ervin Knuth, professor emeritus at Stanford University, is a computer scientist whose work established the analysis of algorithms as an academic field. He is well-known for his influential multi-volume work The Art of Computer Programming (TAOCP); the computer typesetting system TeX, its companion Metafont, and the typeface Computer Modern; the WEB and CWEB literate programming systems; and the MIX/MMIX educational computer models.
Knuth officially eschews email. As he describes on his faculty website, “Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things.” His secretaries (including Phyllis Winkler and Maggie McLoughlin) long monitored his email, printed out important messages, and conveyed his handwritten responses. However, this collection represents email sent and received by Knuth himself over the past 20 years, sometimes using his secretary’s email account.
The emails in this collection are a survey of Knuth’s wide-ranging passions, from the expected (mathematical puzzles) to the more esoteric: Sanskrit poems, typesetting non-Latin languages, and music composition (his multimedia composition for organ, Fantasia Apocalyptica, premièred in 2018). His correspondence reflects his detailed perfectionism—he offers monetary reward to anyone who finds an error in one of his books—but the joy he takes in his work is palpable on a much larger scale. In one email, he writes:
“I have always been attracted to computer science because it involves beautiful patterns, rather like the way dancers enjoy choreography, and because questions such as "What can be computed efficiently?" are profoundly interesting and challenging...Alas, people these days rarely measure a computer scientist by standards of beauty and interest; they measure us by dollars or by applications rather than by contributions to knowledge, even though contributions to knowledge are the necessary ingredient to make previously unthinkable applications possible.”
(Email message to Richard Bright, 22 Nov. 2009, Knuth (Donald E.) Papers, SC0097, message id 7fbf2f9e22c2fb9468a98a549e06ea55cc9688b78a52188accd584be78d8d068)
Knuth’s email corpus has also helped us improve ePADD, which is a free and open source software platform developed by Stanford Libraries’Department of Special Collections & University Archives to process, preserve, and deliver historical email archives. The ePADD project team has previously written about fixing problems uncovered while processing Knuth’s email.
Knuth operates multiple email accounts, one using Gmail and another using rmail, the default email client for users of GNU Emacs. The Gmail messages came to Stanford Libraries in mbox format, a common email format that stores email messages as text files. ePADD is built to import and read mbox-formatted email, which made ingesting Knuth’s Gmail account easy.
An email message in mbox format:
An email message in rmail format:
However, we found that the rmail files could not be ingested properly into our software. Given their basic similarity to mbox files—both rmail and mbox files are essentially stored as plaintext—we first tried to import the rmail as-is. When that ingest failed, we tried a number of different conversions from rmail to mbox. We were eventually able to load the converted rmail files into ePADD, but ePADD still failed to interpret the date of each message.
A sad fate for one group of rmail messages.
Emails with no dates would make the corpus useless, so we asked ePADD Technical Advisor Sudheendra Hangal and Software Developer Chinmay Narayan for an assessment. While investigating the issue, Chinmay was able to fix a parsing bug that allowed both rmail and mbox files to be ingested more easily. Unfortunately, this fix still left us without dates.
Looking closer at one of Knuth’s rmail files reveals an existing date line in each message, appended automatically by Emacs. This date line is very similar to the date line for messages in an mbox file, but for one crucial difference—date lines in mbox files begin with the string “Date:”, while date lines in rmail files begin with information about the sender instead. ePADD didn’t know to interpret those lines as date information, because the information was not flagged in the way it expected.
Date information as stored in the mbox format:
Date information as stored in the rmail format:
Once Chinmay discovered this mismatch, he wrote a small script to read the rmail date line, convert it into the mbox “Date:” format, and append the new line to the message header. (This fix is not incorporated into ePADD, because it is specific to the way Knuth used Emacs.)
A collection for research and development
The Don Knuth email collection is invaluable to researchers interested in the history of computer science—but it has also been invaluable to the Stanford ePADD team. Widespread testing and use cases make open source software stronger. If you are using ePADD and you run into problems ingesting email, please file an issue at the ePADD GitHub issue tracker.
For more (paper) correspondence to and from Donald Knuth, see the Donald E. Knuth papers, held in the Stanford University Archives. You may also be interested in the 1970s and 1980s backups of the Stanford Artificial Intelligence Laboratory, where you can browse files related to Don Knuth and Phyllis Winkler (Knuth’s secretary at the time); more information on that archive is available in this talk by Bruce Baumgart.