
Vint says:-
"I'm not blaming Microsoft..."
Why the hell not? It's their fault.
Big data may turn out to be a big mystery to future generations, godfather of the internet Vint Cerf has warned. The pioneering computer scientist, who helped design the TCP/IP protocol (along with Robert Kahn) before going on to work as chief internet evangelist for Google, has claimed that spreadsheets, documents and various …
It would be nice/useful if Microsoft provided a function within each Office application to "update all files in selected folder to the latest release of the application", with a tickbox for searching sub-folders and a tickbox for preserving the old document.
I'm sure there are a lot of ways this can break - e.g. I seem to recall that the implementation of pivot tables in Excel changed in moving to Office 2003, so it might be nice to know about this sort of thing...
Perhaps a first pass to do a scan to see which documents are there, which have simple conversions and which will break. Then for the ones that break, a dialogue that walks you through what you will lose if the conversion process continues, allowing you to decide if you want to convert or not.
Hah!
My Word 2007 constantly nags me about compatibility and losing information if I have the audacity to save to .doc rather than .docx. Ditto Excel (my Python scripts don't handle xlsx, which is why I do xls).
Despite their specific assurance that "3 items are being de-formatted, please see help for details", I have never ever found a way to list what is being broken. In fact, considering that many of those Excels come out of report generators that start them out as xls, I rather doubt anything serious is being lost.
And you are asking for a universal converter, with running conversion details? From Steve "we listen to our customers" Ballmer?
Like roff/troff? These were quite well documented, as is Tex. And ODF, whatever its claimed limits probably is decently documented. I've seen some old (pre GUI) Wordstar files that I recall looking much like those of roff, although there probably were differences and the formatting might not have been well documented.
I don't think it is unreasonable to assign Microsoft a part of the blame, along with other less successful vendors who probably practiced the same type of obfuscation.
Like roff/troff? These were quite well documented, as is Tex.
Indeed. More generally, plain-text-plus-markup document formats are much older than horrible-proprietary-binary ones, and far, far more robust. The roff family dates back to CTSS RUNOFF (1964). SGML-based markup languages, including HTML, go back to IBM SCRIPT and GML, from a few years later. TeX1 didn't show up for another ten years, but that's still five years before the first version of Microsoft Word. (LaTeX is roughly contemporaneous with the initial versions of Word, and of course long predates Word for Windows, much less Microsoft Office.)
So the greatly-superior alternative of using plain text with markup was well-known when Word appeared. Of course Word was also not the first document editor to use a binary format. WordStar had been doing it for a few years (though it also offered "non-document mode"); the Wang word processing software was adapted into MultiMate for the IBM PC; WordPerfect had been around since '79 (on Data General machines).2 But the Word developers still made the wrong choice, when both options were well-established.
1Unfortunately, the Reg's subset of permitted HTML tags won't let me format that correctly. Oh well.
2WordPerfect used a markup system internally, but the tags were formed with non-printable code points, so it wasn't a plain-text-markup design.
Look, we have exactly the same problem understanding ancient English,
Well, we would, since there is no such language.
Old English is very different from Modern English, true. It also fell out of use many centuries ago, unlike Office '97.
Middle English (Chaucer, for example) can be easily picked up by anyone highly literate in Modern English. You'll need a glossary for some archaic diction and usage, but most of it is readily obvious from context.
even Shakespeare is a foreign country to most people.
Early Modern (Elizabethan) English shouldn't give any competent reader of Modern English any trouble. Again, a glossary or the occasional footnote helps, but that's often true with Modern English as well, as the language has an enormous vocabulary and is highly irregular.
Even with those irregularities, though, natural languages have sufficient consistency and redundancy that they can be decoded even with very small samples. We figured out how to read cuneiform, for heaven's sake - that's a unique writing system from a linguistic isolate that went out of use thousands of years ago. Binary file formats, on the other hand, tend to be riddled with arbitrary signals, often contain insufficient redundancy, and in many cases are too rare to provide a decent corpus for analysis.
And if the presentation was created in 1997 using a version of PowerPoint earlier than 97?
Word file formats are different for Word/DOS, Word/Win1.0, Word 2.0, Word 6.0-95, Word 97-2003 and Word 2007-2013. Recent versions of Word can't read the Word 6/95 format, much less the three earlier ones; I'm sure PowerPoint is the same.
I'm afraid this is just a case of poor quoting by the Reg. Here's the orignal text as it appeared in Computerworld.
Cerf illustrated the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. "It doesn't know what it is," he said.
http://www.computerworld.com/s/article/9239790/Cerf_sees_a_problem_Today_s_digital_data_could_be_gone_tomorrow_#disqus_thread
The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested.
One approach is to create multiple copies of a document in different formats:
a) the original file, a stream of bits;
b) an updated version of the original file, created as part of a regular migration/update startegy - i.e. use Office 2010 to read an office 1997 doc while it can and save it in Office 2010 format,
and c) create a simple version that concentrates on preserving the important content, not worrying too much about precise layout and formatting and clever stuff (but include the meta data), and save that in a fairly plain vanilla format that is likely to be readable for a reasonable time.
None of the above is perfect, and b) in particular means people have to do a lot of work regularly migrating old documents to a new format.
Of course, this all assumes that you can read the file in the first place. How many County Record Offices have an old Amstrad 8256 word processor handy, with the ability to read the old 3" floppies and export them to some sort of network.
I'm a great fan of the fallback method: print every document out on acid-free paper and store it in a nitrogen-filled vault, or microfilm the printouts - or just print straight to archive-grade microfilm, then all you need to read the document is a magnifying glass and a torch!
"The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested."
I think Vint starts from the Google perspective of wanting to mine that data. In the real world, businesses lose and forget anything electronic that is older than eighteen months, and only the legal/property people have any concept of archiving and retrieving documents, and they usually stick to the physical. Give those archive capabilities to the rest of the business, and you find yourself paying Iron Mountain year after year to store Christmas decorations, the unindexed contents of retired employees desks, or the IT department's original install 3.5 inch floppies for Borland and manuals for applications and operating systems long since gone.
Printed books, documents, and whole lot of original data have a half life (and always have had), and with the accelerated creation of new more and more electronic documents, losing them is rarely going to be that much of a loss. Where data is important and it is used, then it will be refreshed, preserved or updated, indeed the point of vellum was to preserve the important, not the routine. For the rest (including much of my own output) it doesn't really matter if it become unreadable in five or ten years time.
Yes and no. Documents regarded as routine or ephemera today may turn out to be valuable, or of historical importance tomorrow. The point is we can't make the same judgements about the importance or value of documents in advance that hindsight would lead us to in future. Original letters by famous people are some of the most valuable artefacts to appear at auction. Even if its just Karl Marx's laundry bill, someone will pay big money for it.
But MS and others should take some of the blame: their document formats are clearly and simply under-Engineered for longevity. Open formats go a long way to solving this problem, but they are eschewed for many other reasons. Potential longevity is not a selling point, unfortunately
> Of course, this all assumes that you can read the file in the first place.
That is surely the real problem? Reverse engineering a document or file format is easy*, compared to reinventing an 5¼" floppy drive when the only storage you've ever seen is flash memory cards.
* For certain small values of easy, but even so...
Yes, it's not new, and indeed there's been quite considerable effort put into avoiding and alleviating it, like the ELO's Acid-Free Bits recommendations. The only news here is that Cerf's commented on the problem.
Assuming people are still making tools that use the format. Just because it's open doesn't necessarily mean anyone will want to write software supporting it. Sure, in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small.
FOSS isn't a panacea; it may not be subject to the same market forces as normal software, but it's still subject to 'market forces' of a sort.
Just because it's open doesn't necessarily mean anyone will want to write software supporting it
But if they do want to, then they can. Compare and contrast with the difficulties involved in dissecting an old binary format which may never have been documented outside of the company who created it, who might not even exist anymore. It is effectively cryptanalysis. In some cases it is cryptanalysis thanks to deliberate efforts by the vendor to obfuscate data or due to the presence of some sort of DRM.
in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small
If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder. If the data is not valuable, then who cares? The difference is that dealing with open formats is a comparatively cheap job, as the number of people who could write a suitable transcoder is vastly higher than the number of people capable of reverse engineering an undocumented proprietary format.
"If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder."
Not necessarily - the person who values the data doesn't necessarily have anything of value to the person who's capable of writing the transcoder.
For example, I have a whole ton of music I wrote in an old tracker format. It's extraordinarily important to me. But that doesn't mean that, should I run out of ways to load the files, someone who wants to write a program to read it will spring from the ground, willing to work on terms I could afford.
Just because something is valuable doesn't mean that there are resources to take advantage of that value.
Agreed, though, that better (or at-all) documented specs and setup of file formats do help quite a bit. I just have a problem with the, "If it's valued someone will write a decoder" argument; it belies a fundamental misunderstanding of basic economics.
I don't think a new "decoder", as you call it, would even have to be made.
The difference between, say ODF and DOC are that the former has many working, open-source implementations for it, while the latter only has the Microsoft implementation which also just works on their Windows platform.
If Microsoft were to go out of business or people stop using Windows, it gets hard to read these files. This will not happen to ODF. Even if we move to an entirely new platform it would just be a matter of cross-compiling an existing interpreter.
and the connection between a file format and open source is?
ODF helps - its a freely published standards that any one can expect access to , has no orphan licencing issues and is in the worst case human interpretable (ish) XML.
Its a lot more accessible and possible to implement than the competition.
Meta data - particularly on any dependencies in the document would help too.
But if the data is important enough, the free-ness will enable someone with enough intelligence and motivation and motivation to write it from scratch, as you said. With properly open formats the media deterioration and lack of appropriate devices is likely to be the difficult part, something that may not be true with proprietary formats.
"It may be that the cloud computing environment will help a lot. It may be able to emulate older hardware on which we can run operating systems and applications,"
We don't need the cloud for that. I've got an 8 -year-old-desktop that can emulate 20-year-old hardware perfectly fine. Heck, I remember running a Z80 emulator on a Z80.
Agreed. Emulation-as-a-service might be useful for some people (for convenience and automatic translation), but utility provisioning of IT resources is just an implementation issue. There's no need to invoke the magic "cloud" here, and I have no idea why Cerf did so.
This post has been deleted by its author
> his up-to-date version of Microsoft Word can't read Powerpoint files created in 1997
... nothing of any consequence has ever appeared on a PP presentation.
Unlike present day archaeology, where making a "find" is a rare event due to the scarcity of old artefacts, I expect the researchers of tomorrow will have the opposite problem: trying to work out which is THE ONE significant piece of work amongst the hundreds of billions of pieces of crap, spam, tweets and pr0n. After that, decoding the format (surely just stripping out all the non-ASCII is 99% of the job) will be a trivial matter.
Not the researchers of tomorrow, but those of today.
Out of personal interest, I've been cataloging YouTube videos of the March 11, 2011 Tohoku tsunami. The extremes are (a) those videos that have been watched by hundreds of thousands of people and reposted to YouTube by a good many of them; and (b) those that have been watched by very few and exist on YouTube only in one version. The object is to identify the best version of each significant video, best meaning most complete, preferably with a good deshaker applied.
Of course this is a hopeless task, as there are something on the order of 100,000 tsunami videos, far too large a number to catalog by hand. But even disregarding that minor issue, trying to figure out which version is original and complete is like trying to find a needle in a haystack. Today.
"Spreadsheets, documents and various collections of data will be unreadable by future generations."
And nothing of value was lost.
"What I'm saying is that backward compatibility is very hard to preserve over very long periods of time."
Not it's not. It's hard when it is made by a whole bunker of losers who foist negative externalities created by clueless primadonna uberdevelopers on the unsuspecting world because "muh bottom line!". Akin to dumping radioactive crap into the nearest river (you hear, government nuke "labs"?)
The article was written in quite a reasonable way, I thought. Compare and contrast that with what you've written. If I were looking at a way forward, I don't think I'd be paying much attention to you - it's obvious you have one opinion and by God you won't consider anything else ...
Except when it's utf-8 - been burned by the two char space that looks fine in all editors but doesn't render properly when you turn it into a book.
More seriously - only geeks like us use text. This doesn't help at all. I think the point other people are making about OpenOffice is valid. I've personally had far more success with it opening documents that MS own software. I even managed to rescue most of the content from a document that Word had completely broken.
only geeks like us use text
Quite a few people use HTML, I think you'll find. And more importantly, non-technical users will happily use plain-text-plus-markup if they don't have to know about it. There's no reason why a WYSISVSTWYG1 GUI can't be slapped on top of a markup file format. That's what WordPerfect did; though its format was not, alas, plain text, there's no reason why it couldn't have been. That's close to what LyX does with LaTeX, except that LyX doesn't pretend to be WYSIWYG and exposes too many technical features for some users' comfort. But there is nothing inherently "geeky" or specialized about plain-text-plus-markup as a file format.
1What You See Is Vaguely Similar To What You Get
Truth is, the number of competing formats has if anything decreased over the years - which should make things easier as time goes on.
[I was once given the task of recovering data from a set of magtapes, with no information at all as to what system they came from, how they had been written, or how they had 'happened' to come into the particular organisation's possession. Even the brand-name of tape manufacturer had been expunged from the spools. After doing a physical block-dump to disk and a week or so of frobbing-about I determined the format to be an Eastern-European-ICL-1900-series-clone native binary. There was much ensuing happiness. Tapes apparently from the same source would arrive intermittently and unpredictably. If I told you anything more I'd probably still have to kill you].
As has already been said, backwards compatibility and data death has been a perceived problem for years. Perhaps the Doomsday Project was the first 'popular news' item with the death of the hardware (laser disc) making the media unreadable.
Not having the software to decode the data is a second problem.
"Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time. Pick some of the 'totally compatible' formats from history and try to read them - for instance wordstar format on a 360kb disk used all over the place ... until word .doc replaced it. I might be able to at the moment but that's only a 30 years old data set ...
So if you can a) find a drive that will take the disc and b) find an os that will read the disk format and c) find software that will decode the file you're fine ...
Perhaps there should be a relatively small, open library of data formats established somewhere - disk structures, storage formats, software encoding techniques et al? These can then be applied to data as required rather than having to preserve the tools themselves...
I'm now off to read a book - which I just have to open to decode it ...
"Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time.
No, you missed the point. The open formats have public documents describing the data structures, and XML is designed to be semantically self-consistent. In order for XML and ODF to be completely unreadable in 50 years time, we'd have to destroy everything that uses our current binary model of computing and burn down a few hundred warehouses full of books and paper documents as well.
Your only significant point is the 360kb disk your Wordstar doc is stored on -- but in a competent modern IT department, that document would have been transferred to current media when the old media were retired.
Perhaps there should be a relatively small, open library of data formats established somewhere - disk structures, storage formats, software encoding techniques et al?
IEEE, ECMA, ANSI, W3C, et al. They don't exactly meet your "small" requirement, though.
"Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time.
This is a wild misapprehension of the problem. Try reading some of the actual research in file-format preservation and recovery. There are orders of magnitude differences in the work factor between documented and undocumented formats; between formats that have multiple, open implementations and those that have a single, closed one; between formats that use a handful of straightforward, widely-applied technologies (e.g. XML and zip, in the case of ODF) and those that use one-off encodings; between formats that have high redundancy and human-readable elements and those that are binary gobbledygook. Claiming that all formats are equivalently vulnerable to this problem is like claiming that since all people die, the length of life and manner of death are irrelevant. It's a fallacy of composition of the first water, a sophomoric generalization to the point of absurdity.
Would that this were so. Most big lab instrumentation (eg SEMs, TEMs, XRD etc) uses proprietary software, and proprietary formats for their analysis, storage and presentation of data. This is annoying at least, and some times a big problem (you may not know exactly how data have been processed). The distinction between "raw" and "processed" data from a complex piece of equipment is almost always fuzzy.
Obviously, most data are then exportable (if you're lucky), or can be hand-retyped (yes - still sometimes necessary - aargh!) into csv, spreadsheet, or other standard or open(ish) formats, but these normally don't have the functionality of the proprietary formats and tools.
I'm afraid Powerpoint is often used as a cheap n cheerful way of assembling screen dumps, and copied data from proprietary software on instruments. Not good, but sometimes the best option...
PNG and TIFF (caveat: use a standardized codec) are decent for long-term storage. Most of the public records I've seen are black-and-white TIFFs with CCITT G4 compression (like faxes). These are things like property deeds and engineering drawings, stamped and signed by a bunch of people then scanned. Everything else gets thrown away after a few years, especially the Word docs and CAD drawings. Pretty decent system.
a drawing database that started life in Paradox, moved on to MS Access and currently resides in MS Excel. Twenty two years that database has been alive in one form or another. It's older than my kids.
Paris, cos her most famous work has probably been converted into dozens of different formats, ensuring her place in cinematic history.
Yes ASCII is a "format", but it's a simple one which is likely to survive a long time. I'm still trying to understand what all this other stuff is really good for. Why is everybody so bone headed about needing all that word processor stuff? I think the answer is that there is much less interest in what a document says than what it looks like.
Biggest problem with text is that it doesn't do graphics So if you want to see charts, diagrams, or relevant photos it's a constant "refer to ven1a7.gif" for breakdown showing current trends" or "as detailed in pic049.png" littering the text. Really breaks the flow of reading.
Bummer if the refered files aren't included/missing.
If anyone wanted to read a non-encrypted document in some old format or other, I doubt that they'd need to be qualified to work at GCHQ in order to "break" the (non-)code. It's just reverse-engineering something that actually isn't designed to conceal.
Strong encryption, on the other hand, means that once the decryption key is gone, so is the document.
If there's a document worth preserving, I'm pretty sure it'll get preserved in a way that future generations can read it.
The problem is that 99.999% of the stuff we generate today is worthless. Hell, even the design notes of big-hitting games like Prince of Persia on the Atari (I believe) are pretty useless and boring, even to those interested. Same for the Mac paint app that had it's original source released recently. There probably exists more useful information in a reverse-engineering of the program itself than anything contained in the design documents that could ever be found and preserved. And this is from an industry where the whole product is digital, not just a document or two about it.
The fact is that I have email in front of me now going back to 1999. I have archives for a few years before that. I have code I wrote when I was a teenager. I have huge essays and articles and documents I've written over all that time. The number of times I have to refer to anything older than a year? So small as to be worthless, and usually just because of poor organisation or convenience rather than being a vital requirement. I can't imagine that most of what goes through a computer database needs to stay around for that long, and the stuff that does ends up being on paper in archives for a few decades at most. There's just no need for it.
As Terry Pratchett says: Digital archeologists of the future? Get a real job. DELETE.
I went looking through some files from 2000 for some marketing materials relating to our business to see if they could help our current efforts. MS Word 97 docs. Opened easily. Weren't worth bothering with - unless to show how not to do it. The market and situation had moved on apace and they were just so terribly dated.
The only stuff from that time that was useful was printed materials - little dog-eared but readable.
I should think there's a lot more files in our system from a decade or so ago that aren't much use. Getting rid of them wouldn't save much room but would make it quicker to look through the stuff that was left.
It must be a human instinct to hoard, on a personal front, I have 9999+ emails in my webmail. I'm certain I don't need past acknowledgements of internet orders, or a conversation with a friend in '04 organising a get together.
A good clean out is useful every once in a while.
Amongst the reference material I gathered while learning x86 assembly is a file called "386intel.txt", circa 1986, which is just as readable now as it was then. And, if it for some reason weren't, it would be trivial to write software to make it readable.
Not that I much care about future generations knowing the inner workings of ancient CPUs, but I guess the moral of the story is: create files in propriety formats at your own risk, if you care about them being readable in the future, that is.
Relax V. Most species that have ever existed are extinct. The amount of historical data we have prior to the printing press is paltry compared to what was written but we still seem to make up pretty good stories about the past and presume that what we have is more important than what we've lost (which can't be known).
The premise of the philosophy of progress is that the past is well, outdated and of little value. We want to know the future not what has already happened. Knowing the past doesn't prepare us for a future that is always new, always changing. So if you are a progressive type, it's a non-issue.
This is really a function of human development. As humans get older their attention always turns more to the past than the future primarily because they have little future left. Older people (and I am one) always think that the way it was was better than the way it is or is going to be. Not true. It will suck for people in the future just as it did for us in the past. It'll just suck in different ways.
Holding on to all this big data just means one ends up with a bigger haystack in which one has to search for the needle.
As for Doris Goodwin's book on Lincoln, we needed another book on Lincoln like a whole in the head. It's like getting another book on Jesus or Plato. Studying the past (I was a history major) is a great way to escape the present but it doesn't mean beans for the future anymore. When things changed very little for a thousand years, the past may have had some actual use but now, not really.
>As for Doris Goodwin's book on Lincoln, we needed another book on Lincoln like a whole in the head.
Actually, Obama used Goodwin's book as a guide to forming his own administration - so Doris's work was a damn sight more valuable than anything appearing on this site (including your message and mine).
He actually gave good example: "So years from now, when you have a new theory, you won't be able to go back and look at the older data"
There is stupendous mountain of scientific data collected every day (e.g. CERN alone collects 15PB per annum) and yet there is little warranty that this information will be of any use to future generations. Since referring to old experiments to verify new theory is established, and very useful practice, this is actually important.
Due to a lack of decipherable information (Files in state sanctioned format Ab19-0-7 are unavilable prior to 2025), we are unable to validate your date of birth, nationality, and any work history prior to 2025.
Therefore your retirement/disabilty claim is rejected.
Failure to produce the relevant files (In state-sanctioned format Ab19-0-7) within 30 days will result in your reclassification as Non-person, and any files on you dated from 2025 to present will be deleted.
Have a nice day, and remember, The Council Loves And Cares For You.
... I have the email you & I swapped, discussing bits of TCP/IP (and the code involved) from back in 1975. It's still perfectly legible. The code still compiles, too ;-)
Insert something about "two cans & a string" here ... IOW, KISS!
Beer, for the memories ... gawd/ess but I was young & dumb back then!
There are really two problems here, one is the problem of retrieval over a time span of about one human, say 80 years, and the other is about designing systems that permit information retrieval over really long spans, like 10,000 years.
We are barely getting to grips with the first one, but we have at least learnt some lessons. For example, binary blobs are really bad, whereas text formats are pretty good. Not perfect (how many people can still read EBCDIC?) but fairly trivially decodable.
The other problem fairly rapidly descends into a spiral argument even if you assume the existence of a long-lived storage medium. (Aside: clay tablets seem to do remarkably well!!) Let's say you write something in French or English. Who's to say in 10K years anyone will understand that language? And if you write a decoder, what do you write it in? There is a project called Rosetta (see the WP page) that tries to tackle this issue, but it's not easy.
Finally, and then I'll shut up: the quality *today* of content or items has no bearing of the importance in the *far future* of the same. Think how much we have learnt through archaeologists digging through middens (i.e. sh**-heaps) and turning up bits of junk that tell us so much. Whose to say that a future digital archaeologist won't unearth "Charlie Bit Me" and explain how late 20th century families worked?
You don't "read" EBCDIC, per sey. Nor ASCII, for that matter. Rather, you read the words that the output device translates for you.
With that said, I can still read text/code on cards and tape, and sometimes "think" in octal and hex. A partially sighted friend of mine can read punched paper with her fingers, similar to braille. She's one of the best "big iron" debuggers I've ever known ...
As a side-note, when my daughter was learning to count (age 4ish), I taught her to count to 15 on four fingers. She added the thumb, and then the other hand, on her own. In highschool, she "invented" three extra digits on each extremity, for full 32-bit compatibility ... with her right eye as a carry-bit ;-)
She's a programmer today ... and Sr. Member of the Technical Staff for a Fortune 250.
Teach your kids alternates to decimal numbers early and often ...
But the format must be as trivial as possible, that's why XML isn't a particularly good solution. It's still somewhat better than binary blobs, but if you have something that is just a table, and you store it in XML it's bad.
As for different character encodings, that's usually not a problem in long term storage. Just dump it out to microfiche as text and OCR it with the next system. That's what banks are currently doing.
>clay tablets seem to do remarkably well
Indeed. I've just been reading a piece on the deciphering of Linear B, from clay tablets. Mind you, these tablets had been baked in a major fire. Unfired tablets would have crumbled to dust, long ago.
But the point is, these were little better than laundry lists - eminently disposable - but the ability to read them now, gave us insight into life in Knossos, that we wouldn't otherwise have had.
"the quality *today* of content or items has no bearing of the importance in the *far future* of the same"
On a personal level, the throwaway photo I shot out the kitchen window of the family home 40 years ago just to make sure the bulk-loaded film I'd just loaded into the camera was on clean, unfogged film became valuable decades later. It was what you saw standing at the kitchen sink, but who would think to take a picture of it? Time changes values.
Office file formats, no matter what office suite or version, were never meant to be archival formats. They were more like save games, little "memory dumps" allowing you to continue the game where you left off, no more no less. In fact some early systems even just dumped the memory onto diskette. (i.e. the Canon Cat) That's why such formats have non-portable options like OLE objects which are nearly impossible to open on another computer. If such a file ever moves from one computer to another you are screwed.
If you want to have something you want to be able to read in a few years or send to someone else, you must use archival formats. Those formats must be as trivially simple as possible. Possible candidates for archiving "printed" documents are TIFF (bitmap format, supports multiple pages) and archival grade PDF (special PDF without all of those useless features). Be sure to include a dump of the text in a separate text file so it's trivial to search. You don't need to change things in your archive. If you want a newer version re-create it again.
Never ever ever store data in file formats you cannot read yourself. Complex (binary) file formats are acceptable only as long as they don't have to be backed up. That's why SQL-Servers tend to store their dumps as simple text files.
I'm quite surprised that nobody mentions HTML? I'm pretty sure I can open every one of these files/pages since the creation of the web. OK the formatting might not be that pretty, but the content will be there and will be structured in some manner that makes sense (P, H1-H5, etc.).
Did I read that right? "One day?" Tried opening any of these lately?
Apple Pie Editor and Formatter Hayden Book Company
Apple Writer III Apple Computer, Inc.
Comprehensive Electronic Office Data General Corporation
DisplayWrite 2 IBM Corporation
Easywriter Professional & II Information Unlimited Software
Executive Secretary Sofsys
FinalWord Mark of the Unicorn
Lazywriter ABC Sales
Leading Edge Leading Edge Products, Inc.
Microsoft Word Microsoft Corporation
MultiMate MultMate International
NBI NBI
Omniword Northern Telecom
Palantir Tier I & Tier 2 Designer Software
Para Text Para Research
Peachtext (formerly Magic Wand) Peachtree Software
Perfect Writer Perfect Software
Samna Word II & III Samna Corporation
SCRIPSIT 2.0 Radio Shack
Select Word-Processing Select Information Systems
Spellbinder Lexisoft
Text Wizard Datasoft
VisiWord Plus VisiCorp
Volkswriter Lifetime Software, Inc.
Word-11 Data Processing Design, Inc.
WordPerfect Satellite Software Intl.
WordStar MicroPro International
WordVision Bruce & James Program Pubs.
XyWrite XyQuest
I can open and read all of those.
Formatting might suffer, depending on the version, but the gist of the subject matter will be immediately obvious. Not all of us throw away all hardware older than two years, and all code[1] older than 9 months. Or maybe I'm just a packrat.
[1] There is no such thing as "software", so-called "software" is merely the current state of the hardware.