The Promise and Challenge of Digital Data
Standing on the Shoulders of Giants
Article by Dan Petrovic, http://dejanseo.com.au/
A great evening with the Friends of the Library brings a fascinating discussion around the promises and challenges of digital data. Four industry experts shared their insight and knowledge around information preservation, digitisation, retrieval, storage and continuity:
- Anna Raunik, State Library of Queensland
- Adrian Cunningham, Queensland State Archives
- Michael Blumenstein, Griffith University
- Linda O’Brien, Griffith University
Written materials such as books, articles and other documents reflect not only knowledge but also our cultural history and digitisation of this content offers new possibilities and greater reach. Information technology evolves rapidly and often without proper standardisation which leaves much of digital information inaccessible for future storage and retrieval platforms. Benign neglect of paper archives has never been a major issue but in the digital age it could mean a bottleneck to data retrieval and potential barrier to information preservation.
By 2020, annual data generation will reach 35 zettabytes. That’s enough data to fill a stack of DVDs reaching halfway to Mars.Adrian Cunningham brings to our attention the fact that the amount of generated data exceeds not only our expectations but also our imagination. Rapidly increasing production of new data poses a challenge for archives who not only have to worry about preservation, but in fact – destruction – of data, such as disposal of selected public records. What to keep and what to discard remains a burning question among those involved in archival process. One of the continuing issues with digital data is technology change and obsolescence.
Microsoft is an example of a company who uses obsolescence to secure continuous revenue. In order to keep up with the latest and fastest personal computers, users are forced to buy new versions of software to run it. Cross-compatibility and backward-compatibility are part of the solution, however, it is hardware that can act as a true barrier if it should be outdated and phased out. Rapid change of technology and storage mediums opens door for a discipline of ‘Digital Archaeology’ concerned with study and analysis of obsolete digital platforms, hardware, software and media.
It’s interesting to find that paper records typically survive longer than digital records partly due to accessibility/platform issues and partly due to the fact that there are no economic ways to reliably store digital information on a permanent medium. What is more important: Bits, systems or context? Well, it seems that context is of high importance and a priority item to preserve. Without context much of available data would not be usable. In terms of actual preservation strategies there are many methods (emulation, migration, normalisation) and choice is equally challenging as is the point at which conversion may take place. Queensland State Archives are already involved in a long-term project with a mission to build and deploy an industrial-scale archive management system.
Author: Queensland State Archives
Date: April 2011
During his presentation Michael Blumenstein introduced a problem of automated document analysis. What at first sounded like basic optical character recognition (OCR) turned into an interesting presentation around scanning, digitising and understanding large quantities of hand-written documents. Advantages of digitising old documents are numerous (wider reach, ability to search, indexability, sorting, preservation, retrieval, portability) but so are the difficulties in recognition of non-character handwritten scripts. Artificial intelligence (AI) and machine learning (likely neural networks) have to be employed in order to work with largely unstructured writing styles with various patterns and deviations. The problem of data manipulation is additionally challenged due to the fact that digitised material contains no meta data and software must rely on content to retrieve relevance.
Layout Analysis of Handwritten Historical Documents for Searching the Archive of the Cabinet of the Dutch Queen
Authors: Marius Bulacu, Rutger van Koert, Lambert Schomaker, Tijn van der Zant
Organisation: Artificial Intelligence Institute, University of Groningen, The Netherlands
Another fascinating concept related to handwriting analysis is automated identification of authors. The process typically starts with a control sample of the writer’s genuine handwriting which is then compared with various candidate versions and tested for authenticity through various points of correlation (example used: Herman Melville). Critical items in quality of analysis are image properties such as resolution, colour and size. The technology involved in handwriting recognition also has practical application in creation of accurate digital datasets, signature recognition and authenticity analysis.
Linda O’Brien delivered the final presentation for the evening leaving the audience with much to think about. We live in the world where more information is being produced than we can manage and more books are being published than ever before. The information sources do not only come from authors, but machines as well (medical, scientific – 14Tb/sec data production rate) but there are search engines such as Google which are capable of digitising thousands of book pages (and even scrolls) per hour (Google Books).
Note: It was very interesting to learn that only a fraction of all the documents available are actually in English.
Digitisation of materials brings broader reach and better accessibility to a wider range of audiences. New generations rely on information being accessible in a digital format at all times and can hardly imagine the information retrieval process to work any other way. Older generations, however, still find it fascinating that information can be found so quickly. (Linda illustrates this with a story about her aunt who was impressed with the ease and speed at which they found an old World War I song online using only a smartphone.)
Where does it all end? New ways of handling data are emerging all the time from community edited encyclopaedias to poets who write scripts which harvest data and create new hybrid works of art. It’s certainly great to hear that research benefits from these advancements from reuse of Hubble telescope data to discovering patterns in spread of diseases determined through digitised literature analysis. From river flow data analysis to preservation of data on burial of nuclear waste which may outlive the data if not perpetuated through continuity strategies. The list goes on and applications are countless and even though we face challenges today, the future of digital information looks bright and we should all be excited to be part of the digital age, where nearly anything is possible.
Presentations were followed by audience questions which were directed at various areas including genealogy, transfer of data in cross-format and role of individuals in the process, short message format archiving and implications of use of cloud computing for storage and retrieval of digital data.
The evening of thought-provoking concepts and new insights concluded with a social segment, drinks, food and live jazz. We’ll be eagerly awaiting another great event with the Friends of the Library.