News

Data Seal of Approval Conference Held in Florence

On December 10, 2012, the first Data Seal of Approval (DSA) Conference was held in Florence, Italy, in advance of the Cultural Heritage Online: Trusted Digital Repositories & Trusted Professionals Meeting. Over 40 individuals from around the world attended the DSA conference, which featured an overview of the initiative, a discussion of the larger European Framework for certification, case studies from repositories that have earned the Data Seal, and instructions on complying with the 16 DSA guidelines.

Presentations are now available online.

U-M, Sloan Foundation enhance open access to research data

The Alfred P. Sloan Foundation and the Inter-university Consortium for Political and Social Research at U-M’s Institute for Social Research are joining forces to encourage open access to research data and closer links between publications and the data on which they are based.

A grant from the foundation will allow ICPSR to work with editors of peer-reviewed social science journals, leaders of data repositories and research funding agencies to foster new standards in research transparency, data citation and sustainable funding models for open access to data.

“Professional associations, journals, data repositories and funding agencies must work together to make the entire scientific venture more transparent and to encourage broader access to research data,” said ICPSR Director George Alter. “The first step is to give scientists who produce important research data the recognition they deserve.”

A primary aim of the project is to promote the effective, consistent and standardized citation of research data. In the past, journals have been inconsistent in citing data, often providing incomplete information about how and by whom data were created. Clear standards of citation will ensure that research data are recognized as significant contributions, and that data producers are given proper credit.

Such standards also will enhance the transparency of scientific research. To reproduce the findings reported in journal articles, data need to be available along with all of the information necessary for replication. Some disciplines already have implemented such transparency requirements, but the project will encourage journals to adopt common guidelines and procedures.

In light of more stringent demands for data sharing among public and private research funding agencies, the Sloan/ICPSR project also will encourage collaboration across scientific domains on sustainable funding models for data repositories. For example, the National Science Foundation now requires a data management plan for all funded research projects, including specifics on how data will be stored and shared. Bringing leaders of data archives and representatives from research funding agencies together to discuss ways to support this and similar mandates will benefit the wider research community.

Daniel Goroff, program director at the Sloan Foundation, said the project illustrates how Sloan’s dedication to the advancement of scientific, technological and economic research also entails support for infrastructure that can enhance the scholarly communication and impact of such research.

“Effective data citation and data sharing practices are good for science and good for scientists,” he said. “Everyone wants research results that are reliable, reusable and reproducible. ICPSR’s 50 years of experience curating data like this for the social sciences can serve as an inspiration for all disciplines.”

Aligning National Approaches to Digital Preservation

We are pleased to announce the publication of Aligning National Approaches to Digital Preservation, edited by Nancy Y. McGovern (Volume Editor).

On May 23-25 2011, more than 125 delegates from more than 20 countries gathered in Tallinn, Estonia, for the “Aligning National Approaches to Digital Preservation” conference. At the National Library of Estonia, this group explored how to create and sustain international collaborations to support the preservation of our collective digital cultural memory. Organized and hosted by the Educopia Institute, the National Library of Estonia, the US Library of Congress, the University of North Texas, and Auburn University, this gathering established a strong foundation for future collaborative efforts in digital preservation.

This publication contains a collection of peer-reviewed essays that were developed by conference panels and attendees in the months following ANADP. Rather than simply chronicling the event, the volume deliberately broadens and deepens its impact by reflecting on the ANADP presentations and conversations and establishing a set of starting points for building a greater alignment across digital preservation initiatives. Above all, it highlights the need for strategic international collaborations to support the preservation of our collective cultural memory.

This guide is written with a broad audience in mind that includes librarians, archivists, scholars, curators, technologists, lawyers, researchers, and administrators at many different types of memory organizations.

Aligning National Approaches to Digital Preservation is the second of a series of volumes edited by Katherine Skinner (Series Editor) and published by the Educopia Institute describing successful collaborative strategies and articulating new models that may help memory organizations work together for their mutual benefit.

Readers may access Aligning National Approaches to Digital Preservation as a freely downloadable pdf and/or as a print publication for purchase. Please visit http://www.educopia.org/publicationsto download or order the book.

Authors include:
Martha Anderson, Inge Angevaare, Dwayne Buttler, Laura Campbell, Sheila Corrall, George Coulbourne, Joy Davidson, Christian Egger, Michelle Gallinger, David Giaretta, Neil Grindley, Martin Halbert, Jan Hutar, President Toomas Hendrik Ilves, Christopher A. Lee, Maurizio Lunghi, Clifford Lynch, Nancy Y. McGovern, Marek Melichar, Wilma Mossink, Adrienne Muir, Andreas Rauber, Adam Rusbridge, Raivo Ruusalepp, Gunnar Sahlin, Sabine Schrimpf, Matt Schultz, Michael Seadle, Katherine Skinner, Bohdana Stoklasova, Aaron Trehub, Bram van der Werf, and Matthew Woolard

--
Katherine Skinner, PhD
Executive Director, Educopia Institute
katherine.skinner@metaarchive.org
404 783 2534

Free one-day Data Seal of Approval conference: Florence, 10 December 2012

The Data Seal of Approval is proud to announce its first conference.

Theme:      Data Seal of Approval conference 2012

Date:                  December 10, 2012

Location:    Historical Complex of Santa Apollonia, Florence, Italy

Conference topics will include:

·      Information on the Data Seal of Approval, including how to apply for the DSA

·      An overview of the European Framework for Audit and Certification of Digital Repositories

·      Case studies

Speakers will include experts from the field of digital preservation.

Attendance to the DSA conference 2012 is free of charge. Please register at Registration DSA conference 2012. Login as guest, no username and password required.

The DSA conference is organized in cooperation with the Cultural Heritage on line conference on 11-13 December, which will include a discussion on trusted repositories within research infrastructures.

The detailed programme is available on the DSA conference 2012 page, where all the latest news can be found. 

Henk Harmsen

Chair of the Data Seal of Approval Board

Are you interested in a question and answer site for digital preservation?

A diverse set of digital preservation professionals, many affiliated with National Digital Stewardship Alliance member organizations, has proposed a new question and answer site on digital preservation.

The goal of this project is to create a open forum for sharing knowledge and best practices for ensuring long-term access to digital information. Please consider committing to support the proposed site. You can do that here: http://area51.stackexchange.com/proposals/39787/digital-preservation

The biggest hurdle is getting people who already have experience with other stack exchange sites to commit, here are three ways you can help with that.

1) If you have over 200+ rep on any stack exchange site we really need you, please commit.

2) If you don’t have experience with stack exchange sites, consider answering, asking and commenting on any one of the 80 some stack exchange sites that relate to your other interests. It won’t take long to get 200 rep and you will learn about the system.

3) Please send a link to the proposal out to others in your organization or email lists that you are on. In particular, please share this with groups of folks at your org likely to have participated in stack exchange sites, like software developers, system administrators, and folks in the sciences who you think might be interested.

Language and literature services

DANS supports language and literature researchers in various ways. Some of them are general, such as the online archiving system EASY for long-term preservation of research data and the promotion of the Data Seal of Approval (DSA) for digital repositories. Others are domain-specific:

CLARIN Centre

DANS is a member of the European research infrastructure CLARIN (Common LAnguage Resources and Technology Infrastructure). DANS is represented in the Executive Board and participates in several projects:
  • The Infrastructure Implementation Project aims at creating a Dutch Service Provider Organisation, existing of four CLARIN Centres of which DANS is one. Long-term data preservation, promotion of the Data Seal of Approval, as well as the further development of Persistent Identifiers are some of DANS’s expertise areas.
  • Search & Develop designs and implements federated search functionality to support finding and retrieving resources across institutional boundaries.
  • In Typological Database System Curator a start was made with defining a policy for long-term software preservation. In the actual project this means curation of software that provides access to the typological data.
  • War in Parliament develops an online demonstrator that retrieves references to the Second World War from transcribed parliamentary debates in The Netherlands.


Recommendations from the DANS inventory of language and text databases

DANS has made an inventory of language and text databases in The Netherlands, in collaboration with CLARIN-NL. An important finding is that so far (June 2011) there are just about 90 Dutch resources in the registry, which implies that the registry is incomplete. Moreover, individual researchers take few measures for digital preservation. The inventory leads to the following recommendations for DANS:

  • Archive as many Dutch language and text datasets as possible that are not yet stored in a DSA-certified trusted digital repository.
  • Urgently carry out inventory projects that lead to retro-archiving, preferably in cooperation with the new CLARIN-NL curation service.
  • Stimulate that all leading literature and/or linguistics institutes are DSA-certified trusted digital repositories by 2016 latest.
  • Take a better look into the desirablity and the nature of archiving sofware tools for literature studies and linguistics.
  • Promote DANS’s expertise and services in these disciplines, e.g.: long-term preservation, the Data Seal of Approval, specific projects for archiving and disseminating language and text data.

It should be noted in this context that it is not the mission of DANS to carry out the, still much needed, digitisation of resources. The inventory is published as DANS Studies in Digital Archiving 7 (in Dutch).

Oral history data
In oral history projects such as Witness Reports and Veteran Tapes large numbers of interviews are being made available by DANS. These audiovisual data can be re-used in various disciplines such as discourse analysis and speech analysis. Oral history projects have also resulted in an online annotation tool.

Major ongoing projects
Text mining and information visualisation are valuable instruments for researchers. CKCC Geleerdenbrieven/ Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic - A Web-based Humanities’ Collaboratory on Correspondences applies them to large corpora of letters exchanged by 17th-century scholars. DANS is involved in archiving letters, annotations, and analyses in a meaningful context.

Alfalab is an initiative of the Royal Netherlands Academy of Arts and Sciences (KNAW), where four scientific institutes cooperate to promote the use of digital methods within humanities research. DANS designs and implements a portal to e.g. the Tekstlab for textual sources.

Small data projects
Small data projects focus on curating (old) data sets of explore new ways to make data accessible. Yearly DANS awards grants for small data projects. Some examples:

  • a website containing a query interface to the WIVU database. This linguistically annotated and segmented database of the Old Testament holds the complete text of the Hebrew Bible in the original languages of Classical Hebrew and Aramaic.
  • Soundbites from the past: a large collection of audio files of Dutch dialect speakers has been preserved and archived.

Establishment of The Language Archive

1-12-2011 Paul Trilsbeek

One important aspect of long-term preservation of data is the stability of the organization that carries out this task. National libraries, national archives and national sound and film archives for example are typically organizations that receive structural funds to carry out the task of digitizing and preserving their collections for an indefinite period of time. Many research data archives on the other hand are located in institutions and research groups that have much less security about their medium- and long-term future. Research budgets are under stress at times of economic crisis and research groups or even entire institutions are being shut down or are forced to drastically change focus if they don’t manage to convince policy makers of their added value.

It is for this reason that the archive of language data at the Max Planck Institute for Psycholinguistics has tried to attract structural funds from the large research funding organizations in Germany and the Netherlands. Even though the Max Planck Institute for Psycholinguistics is doing very well and has expanded enormously during the last years, having structural funds from a number of different national research organizations gives the archive a much better position for the long term.

 

The Max Planck Society, the Berlin-Brandenburg Academy of Sciences and the Royal Netherlands Academy of Sciences are all contributing to what is now called The Language Archive for the coming 5 years, with the clear intention to sustain their funding if the archive continues to be successful in the coming years.

 

The Language Archive was officially opened on the 11th of October at the Berlin-Brandenburg Academy of Sciences in Berlin:

 

http://www.mpi.nl/news/official-opening-of-language-archive-in-berlin

 

Abstracts - Digital Methods and Tools...

Quarta-feira, 16 de Novembro de 2011

Peter Doorn (Data Archiving and Networked Services, Nederland), Computational history among e-science, digital humanities and research infrastructures: accomplishments and challenges
This presentation will focus on the following subjects: first I will briefly introduce DANS; after that I will place the developments in computational history in the context of the developments in e-Science and the digital humanities. Over the years we see a gradual increase in the scale of projects, partly brought about by computation itself and the specialization it requires. Therefore we can see an increased attention for digital data and research infrastructures, both at the national and at the European level.
About DANS:
DANS is an institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) and the Dutch Research Funding Organisation (NWO) and was founded in 2005 (www.dans.knaw.nl). It builds on the work of predecessors, the first of which dates back to 1964 (Steinmetz Foundation and Archive for the social sciences). The Netherlands Historical Data Archive (NHDA) was created in 1989, inspired by the needs of historians and the creation of numerous historical databases, which needed to be archived and kept accessible for later use. The central task of DANS is to provide permanent access to digital data in the humanities and social sciences, although we recently started to gradually expand our services to other domains as well.
DANS maintains a digital archive with substantial data collections in history, social sciences, and archaeology. We also carry out data projects in collaboration with research communities and partner organizations. Moreover, we give advice and support, for example we developed a Data Seal of Approval (see: http://www.datasealofapproval.org/), aiming at quality control of data and repositories, and maintain a Persistent Identifier Infrastructure based on the URN (see: http://www.persid.org/index.html).
In short, DANS promotes permanent access to digital research data; it encourages scientific researchers to archive and reuse data by means of our online archiving system EASY; we provide access, through www.narcis.nl, to thousands of scientific datasets, e-publications and other research information in the Netherlands; moreover, DANS provides training and advice, and we perform research into archiving of and access to digital information.
History and computing as e-Science:
It makes sense to place the developments of computational history in the past decade in the context of e-science, which has been defined back in 2001 as “Science increasingly done through distributed global collaborations enabled by the Internet, using very large data collections, tera-scale computing resources and high performance visualisation.“ (UK Department of Trade and Industry; Research Council e-Science Core Programme). Jim Grey, Tony Hey and others spoke of a “fourth paradigm” in science, characterized by a high data intensiveness. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of e-Science such as databases, workflow management, visualization, and cloud-computing technologies.
Although the scale of humanities research, including the work of historians, is much smaller than that in astronomy or particle physics, most specialists agree that the tendencies and needs of e-science and e-humanities are basically similar. Humanities computing was defined by Willard McCarty in 1999 as “an academic field concerned with the application of computing tools to arts and humanities data or to their use in the creation of these data.” Terms such as computational humanities, digital humanities and e-humanities are now also in use, and essentially denote similar things (with nuance I do not intend to get into).
Since the 1990s many people have come up with definitions for or descriptions of computing in historical research (this list can be easily expanded):
•    Charles Harvey: historical computing must be concerned with the creation of models of the past or representations of past realities.
•    Matthew Woollard: History and computing is not only about historical research, but also about historical resource creation.
•    George Welling: Historical Informatics (computational history) is a new field of interdisciplinary specialization dealing with pragmatic and conceptual issues related to the use of information and communication technologies in the teaching, research and public communication of history.
•    Lawrence McCrank (2002): Historical information science integrates equally the subject matter of a historical field of investigation, quantified social science and linguistic research methodologies, computer science and technology, and information science, which is focused on historical information sources, structures, and communications.”
•    Boonstra, Breure, Doorn (2004): Historical information science is the discipline that deals with specific information problems in historical research and in the sources that are used for historical research, and tries to solve these information problems in a generic way with the help of computing tools
In a study on the “Past, Present and Future of Historical Information Science” I published together with Onno Boonstra and Leen Breure, we distinguished four categories of information problems in historical research, which we ordered on what we called the “life cycle of historical information”: information problems of historical sources (representation); of relationships between sources (harmonization, linkage); of historical analysis (qualitative and quantitative); of the presentation of sources or analysis (visualization, edition). The PDF of the book can be found here: http://www.dans.knaw.nl/content/categorieen/publicaties/past-present-and....
Back in 2004, we were a bit wary on the developments of history and computing in the past few years. It seemed as if the exciting and formative years of historical computing (roughly the period 1985-2000) year were over. Many main-stream historians were just happy to be able to use the computer for text processing, web browsing and emailing.
Probably a degree of specialisation did occur: you simply could not expect every historian to be a programmer, as Le Roy Ladurie once said. The scale of historical research had to go up to get beyond the basic level of computing techniques. Collaboration with professional IT specialists was necessary, and I think we are gradually working towards that direction.
The increase of the scale of digital history projects:
In my presentation I will mention a few examples of big projects we were involved in, and in which computing scientists and historians did work together: the digitization of the Dutch censuses and the project “Life Courses in Context” (the first project in the humanities in the Netherlands to receive an investment grant of a few million Euros; see www.volkstellingen.nl); the project “Climate of the World Oceans”, in which historians, computing scientists and climatologists worked together to retrieve weather observations from historical ships’ logs (www.knmi.nl/cliwoc/); the collaboratory on institutions for collective action (http://www.collective-action.info/); the collaboratory ‘Clio Infrastructure’, building and connecting global data hubs on world inequality, the increasing divergence between rich and poor countries (www.clio-infra.eu). The projects “Telling witnesses” and “Veteran tapes”, in which many hundreds of qualitative interviews have been collected and analysed as “oral histories” of the Second World War and other conflicts (http://getuigenverhalen.nl/) en (http://www.watveteranenvertellen.nl).  The project Medieval Memoria Online (MeMO), which aims to help scholars in carrying out research into memoria during the period up to the Reformation (c. 1580) in the area that is the present-day country of the Netherlands (http://memo.hum.uu.nl/). In all these projects, historical researchers and computing experts (and often specialists from other disciplines as well) from several institutes worked or are working together.
The need for research infrastructures:
It is vital that these projects rest on a solid foundation, not only during the course of the project, but also afterwards. If no infrastructure exists that can guarantee the sustainability after the project is finished, the results are in danger of disappearing soon after the projects’ end, and the investment and effort will get lost. This is exactly why digital infrastructures are necessary: to support and maintain the collaborative efforts. The services developed in the projects need to be sustainable, and they can only be maintained efficiently if they are generic and re-usable. This is why a few years ago, not just in the natural and life sciences, but also in the humanities and social sciences, initiatives have been taken to set up infrastructures to support and sustain the investments done in large (and small) projects. The European Strategy Forum for Research Infrastructures (ESFRI) formulated a first “Roadmap” for the creation of such infrastructures (http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=esfri). DARIAH, the emerging Digital Research Infrastructure for the Arts and Humanities, is one of the two infrastructures proposed on the ESFRI Roadmap for the humanities, including history (www.dariah.eu). DARIAH aims to “link and provide access to distributed digital source materials of many kinds”. In the field of linguistics, CLARIN has been set up: Common Language Resources and Technology Infrastructure (www.clarin.eu), and there are also examples in the social sciences.  In several countries, among which the Netherlands, it is proposed that CLARIN and DARIAH will closely work together or even merge.
The digitization of cultural heritage material, among which archival sources, is of great importance for historians and other humanities researchers. And also in this field we see the creation of large-scale infrastructures. Europeana enables people to explore the digital resources of Europe's museums, libraries, archives and audio-visual collections (www.europeana.eu). It promotes discovery and networking opportunities in a multilingual space where users can engage, share in and be inspired by the rich diversity of Europe's cultural and scientific heritage. The width of the endeavor is at the same time it’s limitation for researchers: although millions of heritage objects can be “explored”, the content and descriptions are oriented to the consumption by a general audience, not towards the analytical use of specialists. The European Holocaust Research Infrastructure, which is supported by DARIAH for solving the technological challenges of bringing together virtual resources from dispersed archives, is a good example for an infrastructure on the interface of heritage and historical research.
Conclusion:
The intention of the organisers of the Lisbon workshop on Digital Methods and Tools for Historical Research is to discuss the implications of using digital technologies in the production and dissemination of knowledge in History.
Two of the implications I have highlighted is the increase of scale of digital history projects and the need for research infrastructures to sustain the results of digital projects. Multidisciplinary and international collaboration is inevitable for professional results. Computational history is in this sense comparable to (of simply part of) data driven e-Science.
This conclusion is independent from the type of methodology we look at: whether it is relational databases, geographic information systems, (text) encoding or digitization and preservation of digital memory. Such methodologies rarely stand alone in a digital project, and are rather phases in the cycle that many digital projects go through: after digitization comes the encoding (in textual sources) or the structuring in databases. Analysis is the next phase, for which GISes are very useful in the case the data has a geospatial component, which can be visualised. At the end of the cycle, proper measures need to be taken in order to keep the results accessible for the future.

On governance, trust and certification (iPRES 3)

Wednesday 2 November 2011

I am going to blog about this parallel session in reverse order, because that way it makes more sense to me.

_DSC9921 At the end of the session (but at the beginning of this post!) Devan Ray Donaldson of the University of Michigan reminded us what ‘trust’ (as in Trustworthy Digital Repositories, or TDR’s) is all about: end users (those that have had no involvement in either production or archiving of a document) need some assurance that the document they are getting from an archive is, in fact, authentic, that it is what it is supposed to be, and has not been tampered with or altered in any way. [BTW: that does not mean that the archive guarantees that the information in the document is reliable. The archive does not know that. The only thing an archive can do is assure that what the end user gets is the same thing that originally came into the archive.]

Archives know that end users care about trust, about authenticity. So Donaldson wants to study how we communicate with the end user about that authenticity. If we put some seal of approval on a document, will the end user trust it more than if we do not put any seal on it? That is an interesting question. Donaldson intends to use HathiTrust documents to test this, and, to me, that is the only ‘flaw’ in his plan – if such is the word, that is. HathiTrust contains digitized book pages, and that type of document is a lot easier to trust and be regarded as “authentic” than, e.g., e-mail. Donaldson agreed, but, as he said: you’ve got to start somewhere.

_DSC9937 Next (in whichever order) came Olivier Rouchon of CINES, a large data centre in France (photo right, “Cannot I even have  lunch without being photographed?” J). CINES finds itself in a strange political situation: as an organization  CINES has a remit for only four years, but  it also has the express mandate to do long-term preservation and its clients ask for 30-year guarantees. This is a strange dichotomy and CINES has decided to seek certification as a trusted repository to a) lock the mission, and b) attract larger volumes of data to be preserved.

CINES went through various (self-) audits to attain ever higher levels of certification. That took a lot of work. Rouchon estimates that 1 fte of his 11 fte’s is constantly busy with audits. But, says Rouchon, ‘that should not stop you from doing it.’ First of all, it is mostly a lot of work the first time around. Once you have a good system in place, the next audits become business as usual. Secondly, CINES is using the audit system as an internal quality assessment instrument to keep improving the quality of the service. By comparing the outcome of audits over time the organization can measure its progress.

The EU is now building a three-tiered certification system: the first level is the relatively lightweight Data Seal of Approval, then comes a self-audit, and the highest level of certification is awarded by an external audit. The APARSEN project recently did a number of test audits, a.o. at CINES, and will publish the results shortly.

_DSC9900 Steve Knight from the National Library of New Zealand enquired how we know that we can trust the auditors doing the auditing. Rouchon trusts his own (internal) auditors and part of the aim of the APARSEN test audits was to train auditors.

_DSC9964 Having talked about trust, and about auditing trust, I now come to the last (first) presentation. Basically, it was about building all the capabilities you need to assure trust and prove trustworthiness into your system. It was also about not dealing with digital preservation as an issue (and a system!) that stands apart from the rest of your organization, but to build an information system for your organization that integrates digital preservation requirements, make them ‘ubiquitous’. Christoph Becker of TU Wien(photo right) told his audience that we have lots of models and concepts and frameworks (OAIS, TRAC, RAC, Drambora, Platter, etc. etc.), but ‘we still lack a holistic view.’ His team takes its cue from frameworks from the IT industry, such as ‘enterprise architecture’, and COBIT (goal-oriented, process-oriented, control-based) to build a Maturity Model based on CMM – you measure your maturity by a set of criteria to identify places for improvement … and then I lost the story. My mind tends to switch off when the discussion becomes abstract and high-level. It is a flaw, I know, but one I have to learn to live with. The basic idea, however, integrating digital preservation, is a good one, and so is using existing industry frameworks, so for those of you who are better at high-level discussions, do check out Becker’s paper in the proceedings which come online soon. The paper is called “A Capability Model for Digital Preservation: Analysing Concerns, Drivers, Constraints, Capabilities and Maturities”.

_DSC9876 Parallel session ‘Governance’

D is for data

Newspapers across the U.S: The Rural American West Initiative

Part of a continuing series of alphabetically chosen digital preservation topics.

I believe a “picture is worth a thousand words” especially when masses of digits form a new shape that presents fresh insights. The Library and National Endowment for the Humanities have been working with partners for several years to build a digital archive of historic newspapers. The project has worked hard to enable text searching and article viewing of these newspapers. But recently I was delighted to see an interactive map showing the spread of newspapers across the American continent over the last 300 years. The Rural American West Initiative created the visualization from the data about 140,000 newspapers embodied in Chronicling America.

The fresh insight is that data sets are not just scientific and business tables and spreadsheets, our cultural heritage collections are now considered data. They are the digital building blocks for interpretation and new discoveries that transform them into entities that we may not recognize as cultural heritage information. Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge.

Until a few years ago, I did not think of digital collections as data. I thought data were gathered from satellites or collected during scientific experiments. On some level, I had the idea that digital libraries would be used online much as they were used in their analog forms. I did not think of them being used as data.   We encounter more and more researchers who want to use collections as a whole, mining and organizing the information in novel ways. When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used the computer and scripts to look for them and sort them into categories. They were not very much interested in reading web pages.

If you need some more evidence of this trend toward data, check out the Digging into Data Challenge.  The repositories available for research include not only scientific information—astronomy, geology, physics, biology, social science surveys, they also include images, film, sound, newspapers, maps, art, archaeology, architecture and government records. The second round of awards sponsored by eight international research funders, representing Canada, the Netherlands, the United Kingdom and the United States will be announced in December.

Guidelines for Data Seal of Approval:Data Archiving and Networked Services

So in terms of digital preservation practice, cultural heritage collections benefit from being thought of as data. In 2005, two Dutch science organizations joined together to form the Data Archiving and Networked Services. Their work, although directed at scientific communities is applicable to cultural heritage archives. The Data Seal of Approval distills many methods, practices and standards into a manageable set of  guidelines that address A) the quality of the data, B) the quality of the data repository and C) the quality of access to and use of the data. The brief guidelines document provides a clear roadmap for preserving digital information.

As digital preservationists, I think we can use the Data Seal of Approval guidelines to A) assess our stewardship of digital information, B) engage with researchers to learn more about how they think of using our digital libraries and C) engage with producers to foster good practices around the creation of the data.