Some of the latest work underway in Digital Library Systems and Services involves adding digital collections to SearchWorks. Last week saw the addition of five new collections to SearchWorks, all created and deposited to the Stanford Digital Repository using the Self-Deposit web application.
Of the five, we’re highlighting Preserving Virtual Worlds, a collection produced by curator Henry Lowood and a team of collaborators in a multi-institution project funded by the Library of Congress. Original software, gameplay samples, technical documentation, web sites, and other contextual information for games like SimCity, DOOM, and Star Raiders are archived for the ages.Henry’s blog announcement sums up the project and collection nicely.
Today marks a major milestone in Stanford University LIbraries' ability to provide easy and seamless access to digital collections. As of today, digital collections will begin appearing in SearchWorks, the Libraries' discovery interface. This means that collections can be discovered in the course of searching and browsing through the totality of Stanford's library collection.
We've been examining whether or not to restore stopwords to the SearchWorks index. Stopwords are words ignored by a search engine when matching queries to results. Any list of terms can be a stopword list; most often the stopwords comprise the most commonly occurring words in a language, occasionally limited to certain functions (articles, prepositions vs. verbs, nouns).
The original usage of stopwords in search engines was to improve index performance (query matching time and disk usage) without degrading result relevancy (and possibly improving it!). It is common practice for search engines to employ stopwords; in fact Solr (http://lucene.apache.org/solr), the search engine behind SearchWorks, has English stopwords turned on as the default setting.
In our implementation of SearchWorks, there was no compelling reason to change most of the default Solr settings; thus, since SearchWorks's inception we have been using the following stopword list: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with.
What follows is an analysis of how stopwords are currently affecting SearchWorks, and what might happen if we restore stopwords to SearchWorks, making every word signficant for every search.
The (meta)data underneath SearchWorks is largely based on our MARC records from Symphony. MARC records are exported from Symphony, then slurped up by an application called SolrMarc, which transforms the MARC data into an index for the Solr search engine used by SearchWorks.
SolrMarc is open source software made available by Bob Haschart of the University of Virginia Libraries. SolrMarc is used by all(?) VuFind sites as well as most Blacklight sites built on MARC data (e.g. SearchWorks). SolrMarc has been great for us -- it gave us an enormous jump start for SearchWorks. Bob is also a great guy, and made me a "committer" almost immediately -- so I can make contributions to the open source code.
Open Source Software does best when there is a critical mass of developers: group wisdom rocks, as does sharing the work. To date, SolrMarc is very much Bob's project, despite a number of committers such as myself. There are some ... interesting ... practices as to how SolrMarc is organized and how it is tested. I've even contributed a bit to some of its squirreliness. Occasionally, changes to the SolrMarc codebase break the code I've written especially for Stanford.