Tutorials and examples


Common Crawl: Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Sample Python code for use with mrjob to count HTML tags in WARC files, analyze web servers in WAT files, or count words in WET/parsed text files.

Common Crawl: Java and Clojure examples for processing Common Crawl WARC files

Sample Java and Clojure code to read WARC files on or that have been copied locally from Amazon S3.

Common Crawl: So you're ready to get started

Orientation to web archive data and formats available from Common Crawl with links to related utilities.

Common Crawl: WARC/WET/WAT examples and processing code for Java + Hadoop

Sample Java code for use with Hadoop to count HTML tags in WARC files, analyze HTTP responses in WAT files, or count words in WET/parsed text files.

Internet Archive: Archive Research Services Workshop

Workshop curriculum for the RESAW Web Archives as Scholar Sources conference including software setup instructions and exercises for processing and analysis of web archive-derived data using Elasticsearch, IPython Notebooks, Kibana, and Pig.

Internet Archive: Web Archive Analysis Workshop

Workshop curriculum featuring code samples and examples for extraction and analysis of entities, links, text, and other web archive-derived data using Hadoop, Pig, Giraph, Hive, and Mahout.


Analytical Access to the Domain Dark Archive

Research collaboration between JISC; Institute of Historical Research; British Library (UK Web Archive); and the Centre for Research in the Arts, Social Sciences, and Humanities (CRASSH), yielding multiple research projects and insights for web archive access requirements. Recent blog posts include links to resulting research reports.

Big UK Domain Data for the Arts and Humanities (BUDDAH): Project case studies now available

Case studies of arts and humanities research using the UK web domain dataset.

Examples using Common Crawl Data

Research projects and publications based on Common Crawl datasets.

Google Scholar search results

Google Scholar search results list for variations of "web archive".

Internet Archive: LGA Example Use Cases

Examples of analyses performed using LGA files, including longitudinal link clustering and most popular image calculation and plotting.

Internet Archive: WANE Example Use Cases

Examples of analyses performed using WANE files, including extraction of top people, organization, and place named entities from two web archive collections.

Internet Archive: WAT Example Use Cases

Examples of analyses performed using WAT files, including web server geo-location, longitudinal term frequency counting, and link graph evolution.

Oxford Internet Institute: Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

Research collaboration between Oxford Internet Institute and British Library (UK Web Archive) to support longitudinal link analysis for social science research. Blog posts include links to resulting research articles.

Web Archives for Historians

Blog focused on use of web archives for historical research.