Long term storage of things digital

“Digitalisation” is a buzzword in the humanities, closely connected to the Digital Humanities, to open access, reproducibility, sharing heritage and so forth. I believe it is a very important step to a better and more open science. There is one point though, that is important to me: I’ve worked with books 115 years old. How can we make sure people will be able to read our digital output in another 115 years?

Digital catalogues

Not just the data base project Arachne, which has been working closely with the Research Archive for Ancient Sculpture in Cologne, the Archaeological Data Service, IANUS, but also projects like The Digital Dead Sea Scrolls or museums like the British Museum and its online collection are aspects of the same phenomenon: We are more and more documenting research data or museum collections in a digital way and putting them – sometimes more, sometimes less accessible – online. This is great! I am very much in favour of this development, especially if research data is shared in such a way, that other researchers don’t have the mistake-laden task of re-digitalisating printed tables (or even worse: their decriptive catalogues *shudder*). Some publications are now taking advantage of online services by referencing in the printed work to their larger online catalogues (e.g. Danner 2017). This is much more effective, reduces printing costs and may save a few trees. Also the amazing possibilities of comprehensively searching for data (the search spanning several data repositories at the same time) as ARIADNE proposes will surely transform how we do research. And, by digitizing archives from different research institutions, information that used to belong together might be found in one place again, as the Cuneiform Digital Library Initiative does with ancient Mesopotamian archives.

There is just the one topic, we need to consider: How can we make sure people will be able to read our digital output in 115 years? Especially in light of an all-digital documentation of archaeological digs, that some people favour (and I don’t), this is very important, because otherwise information may be lost forever. Still, this is not just important for data, but for publications as well: We continuously move away from printing things, almost all journals offer some sort of PDF-download (either open access or behind a paywall) and some journals are even completely print free (such as Internet Archaeology).

What kinds of data storage have worked so far?

The oldest remaining intentional transfer of knowledge from one person to another without the spoken language may be palaeolithic art (as argued by Genevieve Von Petzinger). We can argue that information may have been transmitted using these paintings, but we are not able to read and understand them correctly. Similarly we can discuss most of prehistory’s art. Some of it may have been symbolic in nature and meant for a knowledge transfer, some may have had completely different purposes. Without a “translation” we can only speculate.

In the fourth millenium BC so called “tokens” in different forms, meant for counting different wares as well as pictograms were developed in the Middle East. The first true writing systems that we can decipher and which survived to our times were written at the turn to the 3rd millenium BC in the Middle East and in Egypt. These surviving texts have mostly been inscribed in stone or on clay tablets – hard and very durable materials, but also very heavy and difficult to store. Only very few of the surviving texts since these times can still be read today – and this is the result of decades of research and lucky finds like the Rosetta Stone or the inscription of Behistun, where the same text has been written in several scripts and languages. Other languages and writing systems are lost forever, e. g. if they’ve only been recorded on short-lived materials (we do not even know how many these are) or because there is no way for us to translate them (such as the letters of the Indus Valley Culture or Linear A).

In conclusion these are the two things that can happen to an information: material failure of the data carrier (the medium) and translation ability failure. The data carrier may not survive (or the data may not be readable anymore) and we may still be able to “see” the data, but not be able to understand it.

So, first things first: How long do different media survive?

Stone: under good preservation conditions: several 10 000 years (so far)
Burnt clay: 5300 thousand years (so far)
Papyrus / paper: 4600 years (oldest so far)
Magnetic tape: under good conditions several decades (advertised as up to 100 years, but that’s theory so far)
Microform / microfiche / microfilm: 20-500 years, depending on the material used and storage facilities
CD / DVD / Blue-Ray: … this depends on so much… 2,5-30 years.
HDD: 6-11 years if used
USB: depends on a lot, but 3-5 years are usually a safe bet
SSD: depends on writing cycles and capacity, but not suitable for archiving (needs to be kept in use)
… this list is by no means complete, but have a look at wikipedia for a larger one or at this exhibition (German only, sorry)

You may notice, that the modern data carriers do not really keep up with the older ones. There are quite a number of pages discussing this topic online (e. g. on reddit, c’t, wikipedia, as well as in scientific papers). So we are quite aware of this problem of safely storing data for a long term and servers are structured in such a way, that they rewrite data regularely to new drives to make sure it is not lost. But as long as we keep documents on our own hard drives / CDs / USB-sticks we need to remember this as well. Who hasn’t got an old CD they cannot read anymore? I for sure have several… whereas my father has decades old floppy discs that still work.

Maintaining translation ability

Unto the second problem then, which is also being discussed by data scientists and archaeologists: How do we guarantee people will be able to open and modify a digital file we create today in the future? I already talked about “software archaeology” and what software maintenance means for the industry. The same is very important for digital data collections: As software is discontinued, we need to be able to change the data formats in which we save our data. I once worked with WinBASP (Windows Bonner Archaeological Software Package), which has not been supported for years. To be able to use this data set for other things, I either need to transfer it to BaspPast, so I can open it in PAST to continue working on it / save it as an excel-file to open in Excel… or I save it directly in a *.csv-format, which almost every software can read. This kind of problem will continue to pop up, because the development of new programs and new data formats will not slow down. Why should it?

So, if we save the documentation of an archaeological site only digitally or if we have papers only online and not printed anymore, how can we make sure all this is not lost to future generations?

Long term storage: Some solutions

Standards

One very important point here are standards, because if people agree to use at least a moderate amount of different data types, I believe, chances are better future people will have developed mechanisms to transfer those standard data types to their new ones. The Archaeological Data Service: ARCHES (English) and IANUS (German) have guidelines, which data formats to use. They usually plead for open data formats which are well known to both commercial and non-commercial software. Using only proprietary file formats will be problematic the moment this company does not exist anymore. We simply cannot expect every software to be maintained forever and if a company takes the information on how to read a format “to its grave”, people may have a problem. Open file formats, in contrast, can be read by anyone, as there is no black box which might make it impossible to open them.

Access

A further point will be access: So far, paper archives are easily usable by anyone who can go there (and maybe read an old Sütterlin-script or similar), but how do you access a digital data archive when APIs and website structures may change over time? The development of solutions to new technical “revolutions” will be an ongoing process. A solution, which might be a step in-between, but which, I fear, will not be continued for much longer, is one I’ve seen in use at the Heritage Management of Saxony-Anhalt: They print out their reports and data bases and store them in paper format as well as digitally. As long as this is done continously in both formats, this can be very sensible. Sadly, working in their archives I already noticed a few discrepancies between their digital geographic information system and the hand-drawn map belonging to each “Ortsakte” (a file recording all information on archaeological finds in a certain area). This was quite unfortunate, because in the beginning I only worked with the analogue material. I’m very thankful I got the *.shp-file later, too.

Nonetheless: Using different areas to store the same information might be a good idea. As long as there is a very well adjusted cataloguing system. Because the best data doesn’t help anyone, if you cannot find it (for digital data this is also relevant on the bit-level).

Efficiency

As data storage is expensive (think of servers costing money, software developers mainting the servers and the data integrity) the important things should be archived in a way that is efficient. I can think of *.csv-files instead of *.xlsx-files, which are much smaller AND human as well as machine readable, but surely there are other examples as well.

What can you do?

So, what can we as “normal researchers” do to make sure our data and texts don’t get lost in the next 115 years? I think I can break it down to these five points:

use open data standards
use a congruent file system and document it
give your data / papers to a trustworthy hosting service (such as ARCHES, Arachne, Zenodo or your university’s repository, NOT just github and “academia.edu”) and
make sure, you get a DOI or URN or whatever persistent identifier you prefer!
hope for the best

And if you as a “normale human being” want to make a contribution to ensure humanity’s diversity is not lost with the digitalisation, join the Memory of Mankind and let them burn your memories on a ceramic tablet, to be stored deeply in a salt mine for future generations to find, when we have replicators and are conolizing mars, but forgot how to open *.docx-files… or after World War III has destroyed our technology. Whichever future you find more probable.

If you’re not quite as pessimistic, have a look at the Internet Archive. This is an amazing project, which stores all things digital and offers it freely accessible to everyone (“Our mission is to provide Universal Access to All Knowledge”). Gives you hope, no?

To conclude: Long term archiving of digital data is a crucial issue, not just for the future of archaeology. Let’s keep it in mind while being excited about the cool new possibilities offered by the digitalisation!

best practices, documentation, long term, open access, repositories, software, work ethics

Show Comments

Sophie Schmidt

Founder & Editor

About the Author

My name is Sophie, I am a prehistoric and computational archaeologist and have been research associate at the Universities of Bonn and Cologne, as well as for the NFDI4Objects project at the German Archaeological Institute. I teach statistics for archaeologists, work on new methods in settlement archaeology (GIS, geostatistics in R and stuff) and am interested in archaeogaming. Now I started my PhD-project on the 5th mill. BC in Brandenburg (that's North-East Germany).

View Articles