Wednesday, December 24, 2008

On the preservation of knowledge

The oldest multiple-page book is supposedly a golden Etruscan codex, dated to be about 2500 years old. The Etruscan language is lost, though there is hope of translation pending further discoveries. The Dead Sea Scrolls were written over 2000 years ago in Hebrew on parchment and papyrus. Inscriptions on stone walls and tablets date back much further, to the Sumerian and Akkadian cuneiform script over 5000 years ago, and those remain mostly untranslated. What about today's knowledge? How well can it be preserved? There are three main problems with preservation: survival of the medium, retrieval of the data, and interpretation of the data.

How long will today's knowledge survive? Inscription on stone is still the best for longevity, but the storage capacity of rocks is quite low (1 KB per pound?). Magnetic media like hard drives have life expectancy of 20 years or so. Printed optical media like DVDs are suppose to last over 200 years, but much longer if preserved in the right environment. So far, optical media seemed like the most reliable mass storage option - no moving parts or volatile chemicals. There is a vault in arctic Norway that's designed to store plant seeds and preserve bio-diversity for thousands of years (it's funded in large parts by Bill Gates!). We can easily do the same for data. Imagine future archeologists uncovering this vault of treasures!

Some people have pointed out that the main problem with digitized data will be finding equipment that can read the data (computer drives, compatibility of file formats). I remember an episode of Cowboy Bebop where Spike risked his life to find a working video tape player in the forgotten bowels of a museum. I believe the problem of reading today's media is less significant than commonly regarded. Let's suppose that 2000 years in the future, people wish to access information from a Blu-ray disc that somehow survived. It's not unreasonable to assume that they will have technology to scan the binary data from the disc, even without the relic Blu-ray player. All that's needed is the right frequency of laser and sensitive mechanical control. Look at the example of ROM images today. You don't need a Nintendo 64 or Playstation to play old cartridge games. All the information on the physical cartridge can be uploaded as a file and played in a simulated platform - an emulator. We can assume that future technology will be more than adequate to recover information from any surviving digital medium.

What about interpreting the data? How can future historians build emulators of today's technology? Ultimately, it reduces to the problem of translation - one that historians and linguists are familiar with. Let's assume that people 2000 years in the future still use some form of English (as modern Greek is related to ancient Greek). The standard ASCII binary encoding for plain text is fairly easy to crack by modern cryptographic analysis. If we want to be completely obvious, the ASCII table is even short enough to engrave into stone (use synthetic diamond and it will last for hundreds of millenia). From there, we need to build and preserve a collection of plain English text documentation for every file format and programming method. This will be our Rosetta stone for the future. Start with the standard compression standards and codecs: zip, rar, jpeg, mpeg, pdf. These documentations allow future archivists to build a bridge toward data retrieval and translation. Currently, the organizations ISO and IEC publish technical specifications for all these standard formats. The same method of preservation can be done for computing platforms. If we archive the source code of an emulator, people in the year 4000 AD can rediscover the joys of Tetris.

To be continued...

No comments: