Guest blogger: Rohan Cherivirala (University Archives student employee)
(University Archives student employees Avi Udash on the left and blog post author Rohan Cherivirala on the right)
Hi! My name is Rohan Cherivirala. I am currently a freshman at Stanford University. As of now, I plan on double majoring in math and computer science. Since my second quarter at Stanford, I have been working in the Stanford University Archives and have enjoyed every single minute of it.
Over the past few weeks, I have been working on creating a program to convert old formats of archive files into newer ones. More specifically, the goal of my project was to create a program to automatically bulk-convert old HTTrack archives files into WARC files without losing any information. This was important because archived websites stored as HTTrack files are no longer easily accessible and by generating a WARC file, which is a newer archive file format, this information is once more accessible.
The first part of the project was adapting a program created by the National Library of Australia for our purposes, or as I like to call it, add copious amounts of print statements while staring blankly at the screen. The main issue with this step arose from the fact that the National Library of Australia created the program to convert an older version of the HTTrack file, meaning that I was left with a plethora of compatibility issues to fix. Luckily, through the power of guessing and checking coupled with an extreme stroke of luck, I was able to solve these issues in a rather timely manner.
Having completed this portion, the next task was to create a program that could use this previously adapted program to convert HTTrack files in bulk. For this step, I wanted to make a program that would work for a wide variety of use cases to allow for the application to be used in more than just this situation. Along the way, I encountered various obstacles and hardships, with one of the most difficult being how different arguments and options should be processed. After hours upon hours of googling, frustration, and bug-fixing, the program was finally completed. In addition to enabling users to have a wide variety of controls over the output format and other features, the program was created to run separately from the file conversion program to allow for the bulk conversion program to process HTTrack files given the proper conversion program.
The final step was to actually run the program. Looking back, it was much easier said than done. The first obstacle that we faced was logging into a computer powerful enough to run the program, but we were eventually able to figure it out and get Java installed. From there, we ran into another compatibility issue with the program not running properly on Macs, leading for us to reluctantly switch to the dust-ridden Windows machine that stood by it. Once everything was set up, we ran the program and eagerly awaited the myriad of errors that were sure to follow. But, to everyone’s surprise (especially my own), it actually worked on the first try!
Following the completion of this project, I quickly began working on a similar program, but this time, I was working on converting Wget files into WARC files. These Wget files were similar to HTTrack files in that information within them was not easily accessible so converting them to WARC files would once more make this information easily accessible.
Since the structure of this project was very similar to the last one, I was able to reuse a large portion of the code, making this project go by a lot faster than the previous one. For this project, I was also working with my co-worker Avi Udash. Using the knowledge gained from completing the previous conversion program, we quickly created a new program to determine whether a specific folder represented a Wget archive and convert it to a WARC if it did. Together, we were able to complete this project, which was estimated to take a couple of weeks to a month, in two days!
Going forward, I hope that these projects can help other libraries or organizations convert old HTTrack and Wget files to WARC files to help make older versions of websites available to be viewed with today’s web archive infrastructure. In addition to this, I hope that this project will also show how working in the library can be a multi-faceted experience and that there are a lot more things to do than just working with books.
In the future, I hope to continue working on projects similar to these at the Stanford University Archives.