Web archiving

Web archiving is the process of creating reliable copies of web-based content for long term preservation.

Paper records may seem delicate, but digital content is often at a higher risk of loss due to fragile physical carriers and rapid technological obsolescence. In average conditions, paper records can survive for decades stashed away in a box or filing cabinet, but digital content can become inaccessible quickly without appropriate action. This is particularly true for websites as content is changed, updated, and removed frequently – the average lifespan of a website is around just two and a half years.

How does web archiving work?

The most common type of web archiving is ‘crawling’. A web crawler is a tool that navigates through sites via links, making copies of the content as it goes. Copied pages are collated into a standardised file format (WARC) for long term preservation. WARC files also include useful metadata for archivists and future users, such as information about how a website works or when it was captured, which allows us to assert the authenticity of the captured resources. The archived files can then be loaded into a playback tool, such as Wayback or ReplayWeb, where they can be viewed.

Archiving a website isn’t the same as saving the HTML files from a website: it involves creating a file of the website as it appeared on the live web, preserving as much of its functionality as possible. Web archiving does not take a static image or screenshot of a site – instead, it aims to reproduce archived websites in the way they functioned on the live web.

What web archiving is the University doing?

The University first started archiving the web in April 2020 as part of the Collecting Covid-19 Initiative. This aimed to document the University’s response to the coronavirus pandemic, and in order to collect web-based submissions to the Initiative, the Centre for Research Collections (CRC) joined forces with the UK Web Archive (UKWA).

The University partnered with the UKWA again in 2022 as part of the Wellcome Trust funded Archive of Tomorrow project to explore how the internet is used to access, understand, and share health information. The ‘Talking About Health’ collection can be viewed at the UKWA's public site, along with hundreds of other collections curated by staff at the British Library and other repositories across the country.

In 2023, the University appointed its first web archivist to capture and preserve important content from University-managed websites, and staff across the University are now working together to develop a programme to ensure the University’s web-based content is preserved for the future. You can find out more about this at the Digital Preservation blog, Bits and Pieces.

Please note: 

Access to the UK Web Archive service is temporarily unavailable following a recent cyber-attack on the British Library. For more information please visit the UK Web Archive blog. If you are a University web publisher with questions about what this means for captures of your pages, please contact the Web Archivist.

 

Some often-asked questions about how web archiving works, and how the University of Edinburgh is preserving historic web-based content.

There are a few simple steps that you can follow during the design process that can make your website more crawler-friendly and improve any copy that is made for preservation.