Making archive-friendly websites | Library

There are a few simple steps that you can follow during the design process that can make your website more crawler-friendly and improve any copy that is made for preservation.

Make sure your content can be found

Use a sitemap. Crawlers use links to move around a site, so listing all the pages of your website in a sitemap (in XML or HTML) can help them to find everything. The sitemap should ideally be called ‘sitemap.XXX’ and placed at the top-level of your web server (e.g. http://www.example.com/sitemap.xml). You can find guidance on creating a sitemap for WordPress sites on the Blogs.ed support pages.

Provide standard links to content which would otherwise only be accessed through dynamic navigation like search forms or drop-down menus – crawlers can’t interact with a site, so make sure any pages that would normally be accessed this way are represented in your sitemap.

Simple HTML is easiest to archive. Crawlers cannot locate and follow URLs which are embedded in JavaScript or Flash instead of being referenced explicitly in the HTML. If using JavaScript in your site, it is a good idea to place the same content from the JavaScript in a <noscript> tag.

Make sure your site is accessible. Ensuring your website adheres to accessibility standards helps to make your site more usable by everyone, including web crawlers used for archiving. The University has guidance and policy to support this – find out more here.

Use robots.txt to help the crawler find the right content. Crawlers navigate using links, so if your site includes features like databases or infinitely-running calendars, the crawler may get stuck in a loop. Using robots.txt exclusions allows you to block crawlers from accessing specific pages or directories and prevents these crawler traps. You can also use your robots.txt file to be sure that any directories which contain stylesheets or images are not restricted. The UK Web Archive uses the Heritrix crawler which identifies itself as ‘bl.uk_lddc_bot’. To provide full access to the UKWA crawler, include the following two lines in your robots.txt file:

User agent: bl.uk_lddc_bot

Disallow:

Keep URLs stable, clean and operational. Keeping URLs consistent (using redirects where necessary) can help minimise ‘link rot’ and allows users to see the evolution of your site over time. Avoid using variable (e.g. “?”, “=” and “&”) or unsafe (such as the space character or the “#” character) characters in your URLs – these can prevent the crawler from properly accessing all pages. Finally, make sure the links on your site are up-to-date and working – if your website contains broken links, the archived copy will too!

Stick to one domain. By default, the crawler operates on a domain name-basis. If a link takes the crawler to a different domain, it will assume these pages are out of scope and stop crawling. This is also true for images and other ‘secondary’ content – where possible, host objects on the same domain where they are being served.

Audio-visual content

Use explicit links for audio and video content where possible. Crawlers can download common audio-visual file formats, but only if they can find them.

Avoid embedding content if you can. Third-party services (such as YouTube, Flickr, Soundcloud etc) are useful, but they effectively hide content from the crawler.

Use open file formats. Proprietary file formats can be susceptible to upgrade issues and obsolescence, meaning that even where content has been saved in the long-term, it can’t be opened. Using formats that can be read by open source software improves the chances that a site and its contents can be properly accessed in the future.

Be aware of limitations

There are limits to what a crawler is able to access and copy. Being aware of these limitations can you help you to assess the archivability of your site and identify any areas that might need further attention.

Crawlers can’t interact with websites. They can’t fill out a form, search a database, or scroll down a page to load more content. Anything that is generated through visitor interaction like this will not be accessible to the crawler – making sure these pages are listed in your sitemap can overcome this issue. Similarly, crawlers can’t input passwords, so anything behind a login is out of reach to them. If the content you want to preserve has a lot of these characteristics, web archiving might not be the best approach to use.

Useful tools and resources

Validator – Use this tool for checking whether HTML and CSS are validated and compliant with current standards.

Link Checker – Use this tool to check for any broken links on your site. If the links between pages on your site aren’t working, the crawler won’t be able to move around your site properly.

XML sitemap generator - Creates a downloadable XML sitemap for your site. This will help the crawler find all relevant pages.

This article was published on 2024-08-21