back to article Pew: Quarter of web pages vanished in past decade

The web is melting away like so many glaciers these days. A report published by the Pew Research Center finds that digital decay is erasing online news, Twitter/X posts, and web links. The Pew team - Athena Chapekis, Samuel Bestvater, Emma Remy, and Gonzalo Rivero - found that 25 percent of web pages that existed at some …

  1. heyrick Silver badge

    this content sometimes disappears from view

    I've noticed that a fair number of the "official" reviews of films given on IMdb now point to parked domains.

    The problem is that while big companies often keep their content around, smaller outfits have hosting to pay, domains to register, and when something runs to it's end then everything lapses and all of that content vanishes. Moreso if it was, say, a hobby site and the site maintainer went on to do something else, got married, whatever, and no longer has the time or inclination to continue. I also think a number of smaller sites got abandoned as simply having a page on Facebook was so much simpler (and cheaper) than dealing with HTML and hosting and uploading and...

    In the case of big companies, it could be mergers or cutbacks or simply abandoning something seen as generating loss. A large number (at the time) of sites vanished when Geocities shut down (Yahoo! bought it, screwed up, then eventually Googled it into a shallow grave).

    Ultimately, keeping stuff on the web costs somebody, somewhere, money. When the money goes away, so does all that data.

    1. captain veg Silver badge

      Re: this content sometimes disappears from view

      Yes, but...

      All too many commercial web sites periodically and arbitrarily change the URIs to access precisely the same content. I suppose they assume that there's no problem since you can still get at it by entering at the home page, and they don't have the imagination to consider any other way of getting there, but wilfully breaking links is totally contrary to the foundational concept of the WorldWide Web and rather anti-social. If the content was worth putting up, you should maintain the URIs for ever. It's not like it's even hard with modern web server platforms.

      On the other hand, a vast amount of web content is ephemeral spam, of no positive value whatsoever. These should indeed disappear completely as soon as the bastards are rumbled.

      -A.

      1. doublelayer Silver badge

        Re: this content sometimes disappears from view

        It isn't that hard to keep every link together when you've designed the site, but it becomes much more of a pain when you use intermediate software. I've helped a few websites switch their sites from one backend CMS to another. They never want me or anyone else to hand-code HTML and maintain their structure and I don't want to do it either. In each case, they're looking for something where an untrained employee can log in, click some buttons, type some text, and their site changes. I take a very different approach on sites that I run, but those are easy because nobody except me needs to touch them.

        1. Jonathan Richards 1 Silver badge

          Re: this content sometimes disappears from view

          >> wilfully breaking links is totally contrary to the foundational concept of the WorldWide Web and rather anti-social

          > they're looking for something where an untrained employee can log in, click some buttons, type some text, and their site changes

          Nicely encapsulated seed for further discussion, an article, a podcast, and perhaps a PhD thesis. Next Year A Major Action Movie!!

          What we now see on port 80 is so very far removed from what what I believe and hope was in Sir Tim Berners-Lee's mind when he conceived of a World-Wide Web. We've both transferred our dependence for information from durable media (books, articles, film, tapes), and seemingly accepted the ephemeral nature of the digital stores that have replaced them. I worry that the Geocities effect will creep up on information owners and that we shall shortly be in a situation where information (and hence knowledge) will be lost at an increasing and unacceptable rate. One of the lessons I learned when dealing with management of those durable media: if you can't find it (e.g. because the book is wrongly shelved, or the photo collection is not indexed) then you haven't got it.

    2. Martin Summers

      Re: this content sometimes disappears from view

      The digital age could well become the new dark ages, with the only remaining information on this period being from those who wrote and published physical books. I dread to think what the collective knowledge of humanity for the digital age would be represented by with just those books.

      We could well as a race completely forget that certain things existed, although with things like TikTok that would be a bonus. Please don't tell me anyone's written a physical book about TikTok!

    3. Anonymous Coward
      Anonymous Coward

      Re: this content sometimes disappears from view

      I had a private web server for years. Nothing particularly important on it. But when I switched ISPs to one that doesn't provide a public-facing IP for each user, it was impossible to run the server anymore, so it's offline. Perhaps someone, somewhere, will miss the tiny collection of Palm utilities, and likely a couple will miss the simple instructions for how to build a (proof-of-concept only) hovercraft.

      Maybe it's time to fund the Wayback Machine with taxpayer money.

  2. Anonymous Coward
    Anonymous Coward

    Deleting old product data

    Disappearing content can often be done maliciously (IMHO). Anyone like me regularly looking for help with problems on the Microsoft website will have experienced them deleting whole sub-sites on a regular basis. It’s never done with proper redirects or signposts to alternative sources for the information. They just decide one day to take out 10,000 pages and they are gone. The reason is almost always that offering support to keep customers ticking over on an old version is not good business. So much better to force them to pay for the latest by removing the documentation and user forums. Removing the documentation is pretty bad but removing the user forums, where desperate customers post their issues and other customers solve their problem for free with no input from Microsoft is just rude. I’d like to see a law requiring companies to maintain copies of manuals for products they sell for as long as their website exists. The additional costs for disk space would be trivial.

    1. gandalfcn Silver badge

      Re: Deleting old product data

      "done maliciously" Indeed. A notable example of removal with intent was the Daily Express (aka Daily Brexit) deleting most of its claims about the 'known' benefits of leaving the EU.

  3. Anonymous Coward
    Big Brother

    Not only that ..

    Some web pages with negative reviews of the $product are retrospectively replaced with positive adverts for said $product. Down the electronic memory hole we go /s

  4. W.S.Gosset Silver badge
    Alert

    Important Safety Tip

    Wayback Machine (web.archive.org) honours deletion requests. (The Wayback Machine is the subset of the Internet Archive which snapshots pages.) Revisionists (eg activists, virtue-memesters, corp.PR, etc) routinely wipe the historical record of "awkward" facts.

    archive.today does NOT honour deletion requests. If it's archived, it stays.

    If you're tracing & documenting frauds, manipulations, propaganda, misinformation, etc, get into the habit of pasting URLs into BOTH archives.

    1. W.S.Gosset Silver badge
      Linux

      Re: Important Safety Tip

      Tech.notes re archive.today:

      * manual not automatic: it will not revisit URLs to update snapshots. DIY only.

      * screenshots are also taken; v.useful for some code-generated nasties that defeat archiving.

      * the .today throws to various country servers. Eg, .md, .ph, .is. No apparent pattern as to which. They appear to all be seamlessly replicated.

      * if you're in a hurry while a complex page is being archived, you can immediately record/send the final short-code URL by simply deleting the WIP/ segment.

      * Bonus: it defeats some paywalls. Can be useful just for timely reading of key articles.

    2. This is my handle

      Re: Important Safety Tip

      I don't know where archive.today is located, but this seems inconsistent with the (slowly) growing number of legal jurisdictions that have "right to be forgotten" laws.

  5. NapTime ForTruth

    It was ever thus...

    I think we're looking at this with eyes too modern. Death, abandonment, and decay are part of the natural order of things. Everything that lives, dies, dissolves, is lost or forgotten - only to be repurposed either as fertilizer or a new world's fossilized curiosity. It has to, lest all space become occupied to capacity with dusty cruft and remnants (much of which we call "the ground".

    We might mourn the passing of the Tyrannosaurus Rex, but we surely don't wish they were all still alive, generations of them, to entertain us with their rapacious killing...possibly including us, not that we're delicious. ( <--- Perhaps Jurassic Park was the answer to "why shouldn't I wish dinosaurs back", if the wisher was modeled on Jerome Bixby's "It's a Good Life", or Rod Sirling's Twilight Zone teleplay of that work. Quick, somebody write up a treatment and we'll turn it into a fortune!)

    The public Internet was built on the back of sharecropper's hope, fertilized with dreamy little lies about ethereal eternities of connection - "We Can Remember It for You Wholesale" - sold to the rubes for the low, low price of a shiny coin each...and all of your data.

    Let the dead leaves fall, get plowed under, be forgotten, making room for the next dead leaves to fall, etc.

    Ask not for whom the bell tolls.

    [If the references escape you, perhaps they, too, fell to dust]

    1. gandalfcn Silver badge

      Re: It was ever thus...

      "Death, abandonment, and decay are part of the natural order of things" as too is wilful destruction it seems, which is present not only throughout recorded history but also in archaeological records.

      Deletion / censoring of facts and valid criticism is routine throughout the media, from Musk's Twitter to Reach plc

  6. Gene Cash Silver badge

    Try finding out of print books

    Issac Asimov supposedly wrote over 500 books. Try finding 1% of them today.

    James Blish wrote a ton of books, far more than just the Star Trek novelizations. Try finding any of those. Or Alan Dean Foster, Robert L. Forward, Fred Saberhagen, Poul Anderson, James White, James P. Hogan... all gone.

    Try finding classics like "Silverlock" or "Earth Abides"

    I could go on with dozens of authors and titles that are impossible to find, but were excellent books and sometimes entire excellent series.

    Eventually Terry Pratchett's Discworld stuff is going to disappear too. Think about that for a second.

    Edit: and there's even an XKCD about it: https://xkcd.com/1262/

    1. Winkypop Silver badge

      Re: Try finding out of print books

      Exactly.

      I’ve been collecting old SF novels for years. Still have quite a few gaps. I haven’t found any of these books in a long time.

      They used to be available second hand for less than a dollar.

      1. Jonathan Richards 1 Silver badge

        Re: Try finding out of print books

        This is true; you can't easily find a copy to own. However, the words that Asimov wrote haven't disappeared. There will be multiple copies in the Library of Congress, maybe in university libraries too, and certainly within the copyright deposit libraries in the UK (British Library at a minimum). There is nothing analogous to the Legal Deposit Libraries Act 2003 for Internet or WWW published material.

    2. AndrueC Silver badge

      Re: Try finding out of print books

      Eh?

      ..Alan Dean Foster.. I just did an Amazon search for his work and it returns 19 pages of results. The Ice Rigger series. Commonwealth series. Pip and Flinx series. From a look through everything appears to be there, Kindle and paper.

      Isaac Asimov shows 20 pages of results. The complete Foundation and Robot series is the first result for £40.

      James Blish shows 11 pages of results but I'm not familiar enough with his work to comment on significant works.

      Earth Abides is available on Kindle and £10 for the paperback from the same site. Silverlock is likewise available.

      Can you give me a specific book that you can't find? I can find even obscure things like:

      Not the Knight Rider you might be thinking of or his particularly dark and depressing Firelance.

  7. Andy 73 Silver badge

    It doesn't help...

    ..that Google is now actively replacing search that takes you to a third party website with AI that gives an approximate summary of the content you might have read.

    The main "portal to the internet" doesn't want you to leave them and visit the internet. Especially that part of the internet that isn't monetised and stuffed full of (Google run) adverts.

    The incentive to run a website and the audience needed to make it worthwhile is evaporating as well.

  8. Anonymous Coward
    Anonymous Coward

    The Great HTTP Data Loss

    Some of this is from what I call the Great HTTP Data Loss.

    HTTP sites are now never routinely shown by search engines, so what still remains on these sites is effectively hidden from view.

    HTTPS does greatly help security, but has become a barrier to accessing old sites where for whatever reason they have not been upgraded. It is not unusual for me when doing research to simply not find information I know I have seen before, but which I remember came from a HTTP source.

    RIP our history.

    1. heyrick Silver badge

      Re: The Great HTTP Data Loss

      "HTTP sites are now never routinely shown by search engines"

      And the few times they are (like PDF datasheets), Google will absolutely refuse to forward you, so you end up with a blank window and a Google URL that is a lot of gibberish.

      Edit the gibberish to extract the actual URL from the Google junk and press Enter and magically it works.

      Who the hell elected Google as gatekeepers?

  9. Bebu
    Windows

    Two minds...

    So much of online content (perhaps 99%) would have done the world a great service by not being created in the first instance, but our having suffered the indignity of its publication, this material should be permitted, nay obliged, to dissolve into an utterly deserved oblivion.

    Unfortunately the tiny residue that was worthwhile from the outset seems to be disappearing at a faster rate than the above detritus often without leaving a trace.

    I am profoundly grateful for sites like Gutenberg and Canada's Faded Page for preserving titles that have long been out of print and absent from publishers' catalogues. Even these sites and their corps of volunteers aren't scratching the surface.

    I was just thinking of a series of Australian SciFi anthologies published by Paul Collin's long defunct "Void Publications" - how many of those collections have been preserved? I sent my collection of about a dozen to a charity shop in the hope someone might read them.

    Translations of foreign works are even harder to find: Capek was quite difficult to find and some of the translations I possess date from the 1920s (non scifi), his scifi works seem to have fared better in recent times; Lem is safe so far; the Strugatsky brothers were hard to find (now?); years ago it tooks ages to locate a copy of Zamyatin's "We."

  10. Fred Daggy
    Unhappy

    Knowledge just lost ...

    Two web sites I perused for 15 and more than 20 years just disappeared at the start of the year. Tens of Thousands of posts but hundreds of people, mostly good content. Gone, not even photons left.

    In both cases, the owners decided to call it a day. One just disappeared and the other gave a few days notice. Used the Wayback machine to track the last few posts and find out what happened.

    If something like NNTP had have been used, the posts could have been archived, indexed, re-posted. But in both cases they were Web Boards

    And let us not forget the great Technet deletion by Microsoft. Plenty of knowledge that just disappeared overnight. Thousands of 404 links now about their remaining online pages.

  11. Doctor Syntax Silver badge

    OTOH I, like many others, maintain a web page which is a programme of planned events, in this case a local Civic Society's talks programme. If we're lucky* we start off each September with a full programme up to May. Every month the list gets a bit shorter and the poster image changes. There's no point in keeping that page unchanged in perpetuity; if you want check up on last month's talk it's too late already.

    Some stuff really is ephemeral.

    " Since Covid finding suitable speakers has become a bit more fraught than it used to be.

  12. _Elvi_

    .. Is there some way we can accelerate dissolving "X-hitter" all-together?

    .. It would be really cool if we could ..

  13. Anonymous Coward
    Anonymous Coward

    Been saying this for years. Nothing beats a box of paper documents. Recently been passed some documents from the 1950s, they are now scanned and online, but for how long? Those documents - race results have lasted for 70 years, the latest results are only ever digital.

    1. doublelayer Silver badge

      How many boxes of paper have been tossed out because they're big, heavy, and easily recycled? How many more boxes of paper have been destroyed because they were put in a place where they could burn, rot, or be eaten by something? The existence of old paper does not prove that it is better than alternatives. It only proves that it is older than alternatives.

      You can preserve information in lots of formats if you try to do it. The compactness and ease of reproduction are advantages for digital data, but neither will it continue to exist if nobody goes to the effort of doing it. Paper does not remove either of those requirements from the archival process and it makes some tasks more difficult. For instance, if I had found that box of race results in my house, I wouldn't have scanned them. I wouldn't have retained them. I would probably have asked the one person I know who cares about races if he wanted them, then when he said no, into the recycling bin. Unless I could find an easy race archives that wanted paper and was willing to come get and process it, it wouldn't have been preserved.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like