April 22, 2022, by Digital Research

Archiving a website

In this guide, we show you how to download and package a website for archiving.

If you have produced a website for your research project, you may wish to archive it (or be obliged by your funder to do so).

Here, we show you how to do two things:

(a) convert a website to a static version for continued hosting on e.g. github pages

(b) download and package a website in WARC format, which is an international standard for the archiving of digital assets.

We will use a a free tool called WGET. We will focus on Windows users, but if you have a Mac, an Internet search will reveal how to install WGET (for Linux users, WGET comes pre-installed). For Windows users, you can download the latest version of WGET here.

Download the EXE file (it’s only about 5MB), but don’t open it.

(To find out if your computer is 32-bit or 64-bit, type ‘System Information’ in the Windows search bar and look at ‘System Type’.)

 

Now, in the Windows search bar, type ‘cmd’ and open the Command Prompt.

When you have opened the Command Prompt, type ‘path’ and hit ENTER. You will see something like this:

The location highlighted above in red is where you need to move the WGET.exe file that you have just downloaded. Move the file there now, then restart the Command Prompt. (Please note that you will need admin rights on your computer to do this – if you are working on a University computer without admin rights, please request these temporarily via the UoN IT Self-Service Portal.)

To check that WGET is working, type:

wget -h

Hit ENTER and you should see something like this:

It is helpful to create a new directory (a new folder) for your downloaded website(s). For our purposes, we will call this new folder ‘archivedSite’. In the Command Prompt, type the following and hit ENTER to create the folder:

md archivedSite

Then type the following and hit ENTER:

cd archivedSite

Now we are ready to download your website(s) into the new folder. You should only do this for websites where you own the content.

We’ll use the following website as an example: https://www.resilient-decarbonised-energy-dtc.ac.uk/

Static version of the website

To get a static version of the website, type (or copy/paste) the following into the Command Prompt (all in one line) and hit ENTER:

wget --mirror --recursive --convert-links --adjust-extension --random-wait
--page-requisites --local-encoding=UTF-8 --no-parent -R "*.php, *.xml" https://www.resilient-decarbonised-energy-dtc.ac.uk

This will download a folder with HTML + CSS files for every page of the website, as well as accompanying media, such as images. All the hyperlinks in the website will work, including those that point externally to other websites. The downloaded files could be archived or displayed as a live static website on e.g. github pages, for free.

Nb. the download can take several minutes, depending on the size of the website.

WARNING:  static websites won’t handle interactive elements (such as fancy drop-down menus, submission forms, and other sophisticated plug-in functionality) nor can they deal with connections to underlying databases and search/filter functionality. If possible, you should remove these elements from your website before downloading it.

 

WARC version of the website

To get a WARC version of the website for archiving, type (or copy/paste) the following into the Command Prompt (all in one line) and hit ENTER:

wget --mirror --recursive --convert-links --adjust-extension
--random-wait --page-requisites --no-parent -R "*.php, *.xml" --warc-cdx
--warc-file=TYPE_YOUR_FILENAME_HERE https://www.resilient-decarbonised-energy-dtc.ac.uk

This will download a (zipped) WARC file and an accompanying CDX file (which is an index of the downloaded material). Both can be uploaded to a repository, such as the University’s Research Data Management Repository, for permanent preservation.

Nb. the download can take several minutes, depending on the size of the website.

 

And there you have it!

If you have any questions about the above, please do not hesitate to get in touch with a member of the team.

 

Further reading and resources

The full WGET user manual: https://www.gnu.org/software/wget/manual/wget.html

An overview of the most common WGET commands/options: https://www.computerhope.com/unix/wget.htm

Conifer: a US-based web archiving service, which includes a click-through version of the above process

The UK Web Archive, a partner of the UK Legal Deposit Libraries and responsible for preserving UK web content for future generations. You can ask the UK Web Archive to Save a UK Website for you.

For the University of Nottingham’s own digital preservation activity, notably around preserving elements of the main website, see this blog.

Posted in Advice and GuidanceResearch Data ManagementWebsites