It's be another completely pointless use of taxpayers money
http://www.archive.org/
British Library wants to archive the UK web, creating an invaluable national treasure trove of porn, celebrity trivia gossip and Daily Mail comments. But it admits it can't put a figure on the project - which looks like becoming a huge, open-ended commitment for the taxpayer. Today the Library stepped up the pressure for the …
unfortunately. Coverage is very spotty, and I'd say that of ten last 10 times that I've turned to the Wayback machine for help, it's only been useful 2 or 3 times. Either they didn't have anything at all for the site I was looking at, or nothing from the last 3 years.
It's a pity, because it's something that I find myself wanting to use every coupe of weeks.
If only archive.org was usable. But it isn't. It is agonizingly slow, often serves up nothing which in itself means nothing because it has more data than it knows about, and squatters can easily make a previous site at the same address vanish from the archive by robots.txt or some javascript. The arvhice's metadata handling is atrocious, its choice of technology front-end worse (``I'm going to make it PHP so I can learn PHP''), management mostly manages to drive away competent people (``Brewster doesn't like you''). All in all, the archive has made itself very much a station of last resort. You are much better off archiving yourself the stuff you like.
Even so, it makes perfect sense to want to archive (some of) the 'web, acknowledging it isn't all of the internet by a long shot. Thing is, as noted, there is no content filter. So for the time being it would be a better idea to scourge the web and put the good bits through some sort of publishing or curating effort. Whether national libraries should do this or whether the original website or perhaps some third party should do the publishing into some sort of "snapshot format" somewhat akin to a digital book or publication or something is another matter entirely.
The Germans tried to force websites to submit an "archived version" to their national library on pain of some hefty fine, but that didn't go over too well. It's a perfectly teutonic sensible take on things, from a lawmakers' perspectice, but doesn't go down well with the 'web crowd.
So we are left with this open research question: How to organise this flood of mostly crap but with some useful things in it, for posterity? It's a good question, even if the answer seems far away.
Not disputing your facts but Archive.org IS the way to do it. Yes it's slow and relatively backward, but that's because no-one's attempted to compete with it. It would certainly drive progress if another rival internet archive which was far more usable than the clunkyness that they're currently offering.
So the BL is trying to make itself more hip and relevant?
Get lost, bookworms. Let's keep the technology to the competent who are already doing it and leave the dewey decimal system to the dinosaurs. This is like a bunch of filing clerks in the NHS saying they want to implement the NHS IT project, probably the only way it could have been *more* of a failure than it currently is.
By all means pull together a list of a few hundred websites with genuinely useful text based content that you want to archive (and that does NOT include the 10 Downing Street website BTW). The cost of doing that would be manageable and it may even turn out to be a decent resource in the future.
But snagging a copy of every website in Britain is going to cost huge amounts of money and end up archiving millions of pages that no one cared about when they were new, let alone at any point after that.
And it raises another problem, how do they define a British website. Anything that ends in .co.uk? What about the small number of British companies and people who actually managed to snag a .com domain before they were all bought up by spammers?
The current law on books is that every book or periodical that gets published commercially in the UK must be supplied to 5 libraries that hold copies in perpetuity. There is no judgement on suitability. If it's published, it's in. They are just trying to maintain the status quo, and I think that's a good thing. I have seen many websites vanish with only a partial mirror at archive.org . Among the legions of dross at Geocities, there were several gems, including one of the two best internet libraries of Scottish Gaelic song lyrics that were lost.
Then there's the idea of corpus research. Having access to all these tweets and comments would allow language researchers to examine questions like how the internet is changing literacy, and that is a genuinely interesting and important topic.
"Then the Library told us that the private sector couldn't be trusted to do the job, because future funding couldn't be assured"
Given the way this British Government (and, to be fair, several previous British Governments) have behaved, what makes you think that source of funding is any more secure?
I sent a request to ask the BL whether they could archive some of my online work several years ago, for copyright purposes. I suppose this is an answer of sorts.
I know they were having extended discussions about how to archive the data, since digital degrades horribly -- is there any word on that?
It's pretty neanderthal for people to be worrying about the trivial cost of this. I use the BL quite a lot and am thankful that it has archived stuff that a previous commentard would think "irrelevant" from the 16th Century, at far greater expense I might add.
So it's fair to archive the web using an objective agent (in this case a mindless bot) to ensure no preferential revisionism or whatever. Fine, I've got that. But how is it fair to introduce yet another special interest tax on everyone, to archive this [mostly] crap for the few people who do want it. If people do want this stuff then it should be privately funded by donors or trusts and--- oh look! what's this? It's www.archive.org, what does that do?
And no it's true that it isn't a request for funding, but er... yes it is because where the hell do they think the cash to cover this is going to come from? Muggins here, that's who!
No consistency, and no bloody thought gone into this. BL fail.
... that dumped mountains of old newspapers and other rare and impossible-to-replace periodicals from its collection because all of it was hard to store and they'd microfilmed the lot? The microfilms are often dodgy and the film isn't stable, yet paper, if cared for, lasts pretty much forever -- at least centuries, and lets see if a strip of film or a CD that manages that.
Periodicals weren't the only thing the BL got rid of. This is a scandal. And we would give librarians who want to be 'with it', but apparently have ceased to be professionals for whom the printed word is a sacred duty to protect, huge rsources to 'store' websites? Aren't websites much like conversations -- ever-changing? Why not suggest the Powers That Be simply record all of us everywhere?
Oh, wait a minute...
The British Library is perfectly happy to do this, and are doing it, within their current budget - but it's unreasonable to expect them to ask every copyright holder for permissions to archive it.
They just want the law updating to give parity with printed works.
They are right to say a private company would not be ideal because if whatever forms of funding they relied on (donations, advertising, a generous founder) ended the archive would also end and once it is gone it is gone for good.
You're being exceptionally naive.
This is classic empire building. They want our money to do something that will ultimately be very expensive, for something nobody wants. As many commenters here point out, it's pointless. A blank cheque is being requested.
When they can offer more than platitudes and tell us how it will cost, then we can have a public debate on whether we need it.
Outside the kind of budget the Pentagon has at its disposable, this project can only be an epic fail. I don't even know if this technologically feasible, given the resources Google uses just to index the web, copying the contents for permanent storage would be an even bigger task. Given many UK web sites don't use UK domains or are hosted here, huge amounts of content will be missed. it all just sounds and impossible and impractical task. I bet no one even asks the most basic question: is half the crap on the web even worth hoarding?
I block Archive.org from my websites, as do many other webmasters, because it's horribly abused by scrapers. Basically it's used as a source of content for webspam, scraped-content directories, email harvesting, and all sorts of other junk that ought to be blocked. If it's in the archive the originating website can't prevent this activity.
I'm not keen on the idea, it seems redundant. But if it goes ahead there had better be a way to block its spiders.
I've been wondering how this could be done for a while. Financially speaking. Google have shown us how to do it but their money making ideas have sent them mad.
The BL could fund a project to create a distributed (ala folding@home) search engine/store. Data could be replicated across domestic PCs so individual machines don't need to be on all the time.
The carrot for the domestic user? No more Google Gorg.
Set it up in the UK, store UK material near it's users. Then let anyone have the software and they store their (locale) data near them.
Is it a bad thing that I hadn't heard about Cheryl and Ashley Cole getting divorced until this argument?
In fact, I didn't even know Ashley Cole had been cheating.
In fact, I didn't even know that Cheryl Cole and Cheryl Tweedy were the same person.
...Least I knew of their (her?) existence though!
Apparently. A charity I am involved with got a circular from the BL asking for permission to archive the contents of our website. We said yes.
Is half the stuff on the web worth archiving? Of course not. But some of it will be valuable to future researchers and it's not possible to tell in advance in an affordable way what stuff will be valuable.
I would guess that there is a level of archiving that is affordable and useful. For example, I don't believe it would cost much to archive all the static text content, and it's probably possible to identify such material fairly accurately. It would be good to archive some of the dynamic stuff as well, but it's less obvious how to do that.
Would a historian today like to have the thoughts of the ordinary people at the Peterloo massacre rather than just the official reports of the general.
More upto date, the cabinet papers of the miners strike are going to be released in a few years, they are not going to tell you a lot. The postings of people in Barnsley on local websites and live tweets from Orgreave might tell you a bit more.
Or in the future historians are going to think , from official papers, that the NHS It system, Nimrod and the Eurofighter were all good ideas and everyone supported them. If they also had the comments from El' Reg they might think differently.
This is why serious libraries archive pulp novels alongside literature, you don't know when pulp fiction, like Dicken's, might become literature.
To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.
Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/ - because this is a self selected archive of people/companies own sites it gets round the copyright issue, or does it?
We get confirmation from customers that they are permitted to archive the content they ask us to, but in the days of multiple content streams, people often don't know what is actually being delivered via their website in terms of RSS feeds, Twitter searches/feeds, Adverts, news feeds etc. etc.
This is a really good initiative which will cost taxpayers a miniscule amount of money. Imagine if there had been twitter and facebook etc. around in WW1 or WW2, and we could browse all that data for free - what a goldmine of information that would be, what an insight into the past.
If this doesn't get the green light, future generations will dispair that all this information was lost for the sake of a few thousand pounds. And all they're asking for is the right to archive information that is freely available on the interweb, without having to ask each website for permission.
How ironic that the one instance where something SHOULD be opt-out rather than opt-in, is the wrong way round.
This post has been deleted by its author