Experimenting with ePADD: finding strategies for screening and processing email

August 2, 2018
Sally DeBauche

I’m excited to make my debut post in my new role as the Digital Archivist for Special Collections!  Since I’m the newest member of team ePADD I thought it would be only fitting to write my inaugural post on the subject of email.  I recently worked with the email contained in the Robert Creeley papers and it gave me the opportunity to experiment with the ePADD software and find some effective strategies for processing email.  Working on this project also gave me a chance to think a little more deeply about how we process email and how we document the decisions that we make during processing.  As the practice of collecting, processing, and providing access to email archives expands, archivists will need to develop approaches for processing large email collections efficiently and establish new policies for documenting that work. 

The main goal when processing a collection of email is usually to identify and restrict messages that contain sensitive content.  This could include social security numbers, credit card numbers, and other personal information related to either the email account owner or their correspondents.  Although the goal is pretty straightforward, the task of finding that information within a large collection of email and tracking those messages is a challenge.  First, I wanted to make sure that I had a reliable way of keeping track of the messages that I would be restricting.  ePADD allows the user to attach labels to messages individually or as a bulk action, that can be used to note restrictions.  These labels can also be customized, so I created several new labels for my project including “do not publish” and “restrict for 80 years.”

I also created a spreadsheet to track the messages that I chose to restrict.  I simply recorded the message ID generated by ePADD and a brief description of my reason for restricting it.  This spreadsheet can serve as a log of the actions that I took during processing, providing a comprehensive list of all restricted messages, the reasons for restriction, and the length of restriction.  The ability to export labels and annotations directly from ePADD rather of creating a separate document would allow archivists to easily document the chain of custody of their email collections.  In fact, the latest version of ePADD has been released and includes a new function to export labels, so the next time I process an email collection, I can skip the spreadsheet!

My primary means of identifying sensitive content in the email collection was using ePADD's lexicons, which are predefined (but customizable) sets of terms organized under themes such as "personal," "medical," and "academic."  The lexicons are easy to edit; terms can be added or deleted as necessary, and I chose to add several terms to the lexicons that were specific to the collection I was processing.  Using lexicon analysis I was able to identify and restrict messages that fell into these categories.  One technique that I found especially helpful was to use the “test” function, which allows the user to limit their search to messages related to specific terms within a chosen lexicon.  Since the lexicons cast a wide net and include many terms that did not relate to the types of messages I was looking for, searching the collection in this more targeted saved significant time and clicking. 

Even though I felt confident in the job I had done in identifying and restricting sensitive messages, when it came to the documents attached to messages I was not so sure.  ePADD uses the Apache Tika toolkit to index text in files and allow the lexicons to search them.  Apache Tika recognizes most of the common file types, I wanted to try another approach for screening email attachments to add a layer of redundancy to my review process just to be safe.  My colleagues in the Born Digital Forensics Lab have been experimenting with Bulk Extractor, a computer forensics tool that scans files for specific terms, to use when processing other types of born digital materials, and I thought it might have a useful application for processing email as well.  To test Bulk Extractor on the email attachments, I first needed to export the email attachments from ePADD.  This is a simple process, because ePADD allows the user to export just the subset of files not recognized by Apache Tika.  Once I had exported the files that I wanted to review, I ran Bulk Extractor over them.  When I reviewed the reports that Bulk Extractor generated, I did find one email attachment that contained a social security number that I had missed when I performed my initial lexicon analysis.  This extra step certainly isn’t necessary to complete a thorough screening of an email collection, but it may be a good practice when collections that have a large number of files that are not recognized by Apache Tika.

While there are common types of information that we tend to restrict and tools geared towards identifying them, I also found sensitive information in some unexpected places.  When I began processing this email collection, I largely ignored messages and attachments sent through listservs, reasoning that they likely would not include any personal information.  This assumption held true until I stumbled upon a message sent through the Brown University English department’s listserv that included an attachment of the department’s meeting minutes where specific job applicants were named and their candidacy discussed in detail.  Since information related to job applications is typically considered private, I decided that these messages and attachments needed to be restricted for 80 years after the date of the message when, hopefully, those candidates would be retired.  The lesson I learned from this case is that not all listservs are created equal, and ones targeted at smaller communities where members participate actively require greater scrutiny. 

Over the course of processing the Robert Creeley email collection, I found several strategies for reliably identifying sensitive information.  I also found a simple way to track the messages that I restricted and my reasons for doing so, which can serve as a document of the actions I took to process this collection.  I hope to refine these practices as I work to make more email collections available to researchers.  I also hope to contribute to a broader discussion on defining best practices for archiving email so that we can do so efficiently, ethically, and transparently.