Frequently Asked Questions | Library

Some commonly-asked questions about web archiving and the University of Edinburgh's web archiving programme.

How does web archiving work?

Archiving a website isn’t the same as saving the HTML files from a website: it involves creating a file of the website as it appeared on the live web, preserving as much of its functionality as possible. Web archiving does not take a static image or screenshot of a site – instead, we aim to reproduce archived websites in the way they functioned on the live web.

The most common type of web archiving is ‘crawling’. A web crawler is a tool that navigates through sites via links, making copies of the content as it goes. Copied pages are collated into a standardised file format (WARC) for long-term preservation. WARC files also include useful metadata for archivists and future users, such as information about how a website works or when it was captured, which allows us to assert the authenticity of the captured resources. The archived files can then be loaded into a playback tool, such as Wayback or ReplayWeb, where they can be viewed.

How can I find out if my site/page has been captured?

If you are planning on deleting content from your site, it is good practice to first check that content has been archived.

Finding an archived page in the Internet Archive

Search the Internet Archive on a URL-by-URL basis by entering the URL of the page you want to cite into the search box on the landing page, or by appending the URL to the following (making sure to include the correct https:// protocol):

https://web.archive.org/web/*/URL HERE

This will display a calendar view of all the captures that have been made of that URL over time. Select the most recent and click through the site to make sure that all content has been properly captured.

Finding an archived page in the UK Web Archive

To view captures of a site in the UK Web Archive, append the URL of your site to the following:

https://www.webarchive.org.uk/wayback/archive/*?url=URL HERE

This will bring up a calendar view showing all the captures of that URL over time.

If you can't find any captures of your site, it may need to be manually added to the UK Web Archive - please contact the Web Archivist who can can assist you with this. Please note that as of March 2025, captures in the UK Web Archive are unavailable following a cyber attack on the British Library.

Why are some captures incomplete?

We make every effort to ensure all content has been captured, but some content types or ways of building sites are difficult or impossible to capture. For example, we are aware that content hosted on the MediaHopper platform cannot be captured through web archiving.

There are some limitations to what a web crawler can do, so an archived website won’t always look and behave the same as the live site did. A web crawler can’t interact with a website, so it can’t fill out a form, search a database, or scroll down a page to load more content, for example.

Finally, while the UKWA may also capture non-University sites as part of their normal collecting approach, manual crawls will usually capture only the domain or site in focus due to copyright and space restrictions.

The archived copy of my site doesn’t look right. What can I do to change this?

The best way to get a good archived copy is to build a site that is ‘archivable’. There are a few simple steps that you can follow during the design process that can make your website more crawler-friendly and improve any copy that is made for preservation. You can find more information on our page 'Making archive-friendly websites'.

Some websites might be better suited to bespoke capture using open access web archiving tools. If you think your site would benefit from being captured in this way, please contact the Web Archivist who can assist with this.

My site has been archived but I don’t want it to be. How do I get it removed from the archive?

The University of Edinburgh expects all website owners and content contributors to adhere to the Website Terms and Conditions and ensure that all content on University sites is fit to be in the public domain before it is published.

The UK Web Archive is committed to ensuring that UK-published web material that is collected under legal deposit legislation is preserved and made available for researchers to use in the Libraries’ premises. It is possible for access to a specified page or site to be restricted to users at computer terminals onsite in Legal Deposit Libraries in exceptional circumstances.

If you feel you have reasonable grounds for requesting that content preserved in the UK Web Archive or in the University Archives be restricted, please contact the Web Archivist who can assist with formulating a takedown request.

Does the University archive social media?

All web content within the collecting scope of the University Archives – regardless of the platform where its published – may be archived. Due to the technical and regulatory restrictions around some platforms, it may not be possible to capture content to a standard quality or at all. Content published on Facebook, for example, cannot be easily captured, even if pages are shared publicly. Content shared on Twitter/X can only be captured as web pages and not harvested as data due to platform restrictions around API access.

Some Facebook or other social media content may be collected as personal digital archives. This process requires account owners to download their own data via the platform’s own download function and to donate these downloaded files to the Archives. The Digital Preservation Coalition has some great guidance on personal digital archiving for social media that can be found here.

This article was published on 2024-08-21