Tutorials and examples
Common Crawl: Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
Sample Python code for use with mrjob to count HTML tags in WARC files, analyze web servers in WAT files, or count words in WET/parsed text files.
Sample Java and Clojure code to read WARC files on or that have been copied locally from Amazon S3.
Orientation to web archive data and formats available from Common Crawl with links to related utilities.
Sample Java code for use with Hadoop to count HTML tags in WARC files, analyze HTTP responses in WAT files, or count words in WET/parsed text files.
Workshop curriculum for the RESAW Web Archives as Scholar Sources conference including software setup instructions and exercises for processing and analysis of web archive-derived data using Elasticsearch, IPython Notebooks, Kibana, and Pig.
Research collaboration between JISC; Institute of Historical Research; British Library (UK Web Archive); and the Centre for Research in the Arts, Social Sciences, and Humanities (CRASSH), yielding multiple research projects and insights for web archive access requirements. Recent blog posts include links to resulting research reports.
Case studies of arts and humanities research using the UK web domain dataset.
Research projects and publications based on Common Crawl datasets.
Google Scholar search results list for variations of "web archive".
Examples of analyses performed using LGA files, including longitudinal link clustering and most popular image calculation and plotting.
Examples of analyses performed using WANE files, including extraction of top people, organization, and place named entities from two web archive collections.
Examples of analyses performed using WAT files, including web server geo-location, longitudinal term frequency counting, and link graph evolution.
Oxford Internet Institute: Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research
Research collaboration between Oxford Internet Institute and British Library (UK Web Archive) to support longitudinal link analysis for social science research. Blog posts include links to resulting research articles.
Blog focused on use of web archives for historical research.