Everything is data, except when it isn’t
In April, 2017, I had a debate with David McClure and Karl Grossner — at that time both were Stanford colleagues. They argued that everything is data. I vehemently opposed the notion.
For years, I had been working on digital humanities projects in which we struggled with the process of creating structured, computational data out of rich and complex source materials: letters, journal entries, trip logs, etc. The challenge we faced was how to choose what becomes data. And those decisions were intertwined with the affordances of the computational tools we were using. First one has to get historical, incomplete, sometimes ambiguous information from sources into a regular format around which one can build a model. That is the foundational intellectual work that is too often dismissed as ‘cleaning’ as if it is a rote task rather than part of both the analysis and the argument–which it is. Secondly, one has to use computational tools in ways that take the uncertainties and absences into account. Those experiences taught me that converting rich sources to data is always, inevitably reductive, and while tracking what is lost through that process is challenging, knowing what is lost is critical to understanding what can and cannot be learned from the extracted and chosen data.
David and Karl were focused on a momentous technological change in what can be computational. 2017 was the same year that Miroslav Kubat declared, “Machine Learning has come of age.” My colleagues were expressing the justifiable excitement that digital images converted back to their numeric form makes them analyzable and computational in new ways; that word embeddings – mathematical representations of words in vector space– would amplify the potential of natural language processing and dramatically change how we interact computationally with human language. They were right. Every book, pamphlet, manuscript, piano roll and photograph held in the library can be computed against not merely based on the metadata but based on patterns recognizable within the object and across collections of objects.
In retrospect, it is clear that David, Karl, and I were not in opposition so much as we were having two different conversations. But we need to bring those different perspectives on what data is, how it comes to be, and what we can learn from it into alignment –into one conversation – if we are to implement AI in the library in ways that reflect the ethos of the library.
While the three of us were having that conversation, the Always Already Computational: Collections as Data team was making the case that we need to make our cultural heritage collections computational to support changing research practices. And a subsequent grant award — currently in its fourth year —has supported twelve projects, each of which demonstrate the value of making collections computational. The question that remains is how to scale from individual projects to integral services provided by the library. While learning about the technology and becoming proficient is an important part of that move for libraries, it is not, in itself, the answer. Libraries already have the most critical knowledge and skills that integrating AI requires. We have to recognize that and acknowledge our responsibility to apply that expertise.
Fei-Fei Li, co-director of the Human-centered AI Institute at Stanford, has often pointed out the interdisciplinary origins of AI, emerging out of and in concert with work in psychology, neuroscience, cognitive science, and statistics. In her popular 2015 Ted Talk, “How we’re teaching computers to understand pictures” the ‘understanding’ in the title is approached as a mechanical task based on replicating human vision. “Vision,” she says in the talk, “begins with the eyes, but truly takes place in the brain.” One of the research projects she cited in that talk eventually published this result: “We show that socioeconomic attributes such as income, race, education, and voting patterns can be inferred from cars detected in Google Street View images using deep learning. Our model works by discovering associations between cars and people. For example, if the number of sedans in a city is higher than the number of pickup trucks, that city is likely to vote for a Democrat in the next presidential election (88% chance); if not, then the city is likely to vote for a Republican (82% chance).” Computer vision and deep learning were being put to use right away on predictive classification tasks with significant and potentially harmful implications for individuals and society. The critical response came quickly, too, mostly addressing the issues of privacy, surveillance, and monopolistic control of information collection and use. The critiques were, in large part, in the domain of policy and governance.
But there is an even more fundamental problem in the underlying assumption that vision, seeing, and understanding are mechanical processes and it relates to the work of libraries. Current research in computer vision continues to make the leap from vision to understanding pictures without any discussion of meaning and understanding as a social construct that is contested. Here I am not making a theoretical argument about social constructionism. I have in mind the very practical work that happens every day in our metadata department. Metadata librarians need to make decisions about how to describe things. Those decisions are partly rule-based and partly interpretation.
Secondly, a leap is made from what it means for us to see something to what it means for a machine to process a digital scan of a scene without adequate attention to the ways in which those things are different. In the library, when we describe an image we start with questions about context. Witnessing a scene is not the same as deciphering a photograph or video of a scene. That image, even if it was captured milliseconds ago by a self-driving car, has a history, a means of production, and human intention behind its creation.
Libraries and their close companions, humanities scholars, have been thinking about these problems for quite some time. And it is not just hand wringing. There are theories, practices and methods that have developed around the challenges of organizing, categorizing, and classifying, that give careful attention to the way that work both helps us understand what we see and also limits what we see. The same theories, practices and methods can be the foundation of how we manage collections as data for computational research.
The recently published article The agency of computer vision models as optical instruments by Thomas Smits and Melvin Wevers provides critical insight into the data problem behind AI. They demonstrate that the use of data in the computer vision benchmarks is decidedly unscientific. They also make the point that comparing aggregate accuracy rates of models to ‘human’ performance introduces a false dichotomy between the agency of computer vision models and human observers. The work of Smits and Wevers is evidence that the humanities should not be relegated to critique of the ethical implications of AI after the fact, but can benefit the field of AI in essential ways by shaping what it is. Smits and Wevers are both trained in the discipline of History and their work is built upon work done by librarians and archivists.
In contradiction to what I wrote above, every book, pamphlet, manuscript, piano roll and photograph held in the library cannot be computed against unless it is digital or digitized and then transformed into data. Making those transformations is exceedingly consequential. It is a matter of deciding what counts and what does not count. Who is in and who is out. Managing, describing, and making those data accessible to research will be as essential to future applications of AI as our current collections are to humanities research.
When I say, in my title, “Everything is data, except when it isn’t” I’m saying that try as we might we cannot reduce our cultural experience, our scientific discoveries, or our lived experience to a set of data points. But I am also saying —to libraries in particular— that, "Except when it isn’t" — when we keep in mind that we cannot capture everything as data — it is nonetheless critical that we embrace the fact that just about everything in our holdings can be made computational and doing that work will benefit libraries while also advancing the creation of new knowledge. When we convert our library collections to machine-readable data the result will be something very different than the original and many decisions have to be accounted for along the way.
This is based on the prelude to a talk given as part of the "Curating the Campus" series organized by CU Boulder University Libraries, Research Computing and the CU Museum of Natural History to explore how digital cultural heritage collections can contribute to the campus’ research and teaching mission and the community at large. The full talk was recorded and is availalbe here: https://youtu.be/hxyDnIlQOao