93 318 54 36

Internet Archive or Archive.org: how it works and how to recover a website

07/03/2022
Elizabeth De Leon

Internet Archive is an archive that contains all obsolete and no longer existing sites.

What is Archive.org and what it contains

Internet Archive o Archive.org plays the role of a huge non-profit online library tasked with preserving the existence of digital books, videos, movies, songs, images and entire websites from around the world. Every day millions of Internet users use this site, one of the 300 most visited in the world, which since 1996 has been saving copies of online content and making it available to everyone for free.

Behind Archive.org (which is another name for this powerful virtual library) there is a real organization whose administrative offices are located in San Francisco.

The purpose of this organization is to preserve knowledge in all its forms, a bit like libraries, only in this case it is aimed at all types of content, from books to movies, from music to software.

Archive.org is based on Wayback Machine , an application introduced in 2001 that automatically stores scans of websites and makes them available on the portal as "still images."

Pages are saved on servers Archive.org that return them as they were at the time of the scan, even if years have passed since then.

Websites are recorded as if they were photographs, and this also in the case of dynamic sites that are "frozen" and stored with those characteristics in force, including the links within them. The screen provides us with "calendars" from which we can select the version of the site we want to scan: for example we could enter the version of May 5, 2015 or September 10, 2019. Each "scan" of the site is archived for the reason of a precise date and time so it is very easy to choose the version that interests us.

What can you find in the Internet Archive?

Archive.org contains 14 billion textual content, 35 billion other materials, something like 400 billion, and acts as a "backup" of the contents of the entire web from 1996 to the present. It is an immense database that contains multimedia works from the four corners of the globe, an immense help to preserve the historical memory of websites and culture in general.

The works contained in chronological order can be consulted as replicas of the sites in a certain period. In fact, several copies of each website corresponding to different time periods are saved within the servers of the powerful platform.

Of course, we cannot be sure that all elements on all sites in the world are present and 100% complete: graphic files or attachments may not be available. Additionally, navigation can be unintuitive and loading quite slow.

However, compared to the huge amount of content it offers for free, this is a very low limit!

Try it with a site you know and know has been missing for years - it's probably there, ready to be consulted!

How to Find Older Sites with the Wayback Machine

Wayback Machine indexes sites that can be seen by search engines, but it also allows you to scan sites specifically to include in your files. The platform periodically analyzes the site in question to include subsequent versions of the same portal in its files.
Therefore, we can access a long history that we can access to see the versions of the different sites at different times in history. On the site we find a special search form in which we can write the keyword that interests us and consult all the sites that return to us for this.

What is present in Internet Archive?

At Archive.org we can find many contents such as:

  • creative writing
  • old film
  • work of art
  • creative writing
  • Videogames
  • Songs

It is estimated to contain 11 million texts, 1 million images, more than 100.000 software. The contents of the site are divided into different collections such as communities related to audio, video, text files, American libraries, universities, etc. which makes it even easier to find your way around.

The site includes countless items, such as period films and old books whose copyright has expired. The video section, for example, includes countless examinations of visual arts such as war short films, period films, historical television programs, which without this portal would be very difficult to find.

There are also photos

The images category provides us with illustrations that can be used freely under a Creative Commons or public domain license. For example, we can find collections of photographs and illustrations made available by universities and libraries around the world and which can be used freely.

The Wayback Machine automatically catalogs and includes materials on the web. However, the Wayback Machine cannot list a site inhibited from indexing via robots.txt. If sites are tagged with noindex, they become retroactively non-indexable and are excluded from the archive on the Wayback Machine.

Wayback Machine is an excellent platform to study the evolution of a website over time , as well as to find copies of multimedia materials that would otherwise be lost to oblivion.

It is a formidable site for all those who, for example, want to find videos and games that are now unrecoverable, old movies that cannot be found, content from websites that they liked, and for scholars who want to see how a site has changed over the course time.

Within Archive.org we can consult the different elements in the categories Books, Audio, Video, and each classification includes other sub-distinctions. We can search on specific topics such as television programs, textual content, websites.

How can you recover a website with Archive.org?

The website archive can be found at  the archive.org website with the same web design that I had

We enter the domain of the website on the main page in the search field. In our case it will be Repubblica.it

After inserting the link to the website, we see the schedule for saving the html code of the page.

Blue means a valid code 200 response from the server (no server error);

Red (it can be yellow or orange, depending on the browser and operating system of your PC) means error 404 or 403, something that is not of interest when restoring. Green means page redirect (301 and 302).

The calendar colors do not guarantee 100% compliance: on the blue date a redirection is also possible (not at the header level, but, for example, in the html code of the page itself: in the update meta tags (capture of update tag screen) or in JavaScript).

Now let's take a random date like November 8, 2001 and we will see the beautiful Republic page of that day. Seems like a century has passed, huh?

What do you think of this online library? Have you already used it or are you planning to do so? Let's talk about it next!

Do you need to update your website?

Do you need any of our web design services? In IndianWebs We have extensive experience, and a team of programmers and web designers in different specialties, we are capable of offering a wide range of services in the creation of custom web pages. Whatever your project is, we will tackle it.