Digital Preservation Looks Forward
by Amy Friedlander
Amy Friedlander, Ph.D., is from the Council on Library and Information Resources.
What We're Learning at the Library of Congress
A team led by Hal R. Varian and Peter Lyman at the University
of California at Berkeley recently used data from the late 1990s to estimate that "the world's total yearly production of print, film, optical and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman and child on earth" (www.sims.berkeley.edu/research/projects/how-much-info/ ). Although these numbers sound somewhat terrifying, 250 MB, says Victor R. McCrary of the National Institute of Standards (NIST), represents about five percent of the storage capacity of a recordable DVD-5 disc. "But for preservation, the other important aspect, besides capturing the data on the media, is the space to store the media. A three-foot shelf holds 90 CDs in jewel cases; with DVD-5, this only takes up 5.2 inches."
Saving the data does not mean that our children will be able to find what they may be looking for in that sea of bits, nor does it mean they will even be able to display it. Storage media degrades, and files can no longer be read because the hardware and software environments that supported them have vanished. In December 2000, the Library of Congress (LC) successfully requested a special appropriation of $100 million (later rescinded to $99.8 million) from the U.S. Congress, entitled the National Digital Information Infrastructure and Preservation Program (NDIIPP) (PL 106-554). This initiative sets out to evolve a national strategy for the long-term preservation of digital content in collaboration with representatives of other federal, research, library and business organizations. Not all libraries have the same priorities; indeed, weeding a collection may be more important than preserving it for some libraries. But any library or archive that maintains an online catalog probably has at least one important resource to protect.
The Program
LC's efforts to digitize and distribute materials began with the American Memory program in 1990. Originally an educational CD-ROM project, the popular program, based on materials in the library's collections, was folded into an equally popular Web-based initiative. In 1998, senior managers, including Associate Librarian for Strategic Initiatives Laura E. Campbell, who leads NDIIPP; Associate Librarian for Library Services Winston Tabb; and Register of Copyrights Marybeth Peters, began to formulate a strategy for managing digital materials. A report by the Computer Science and Telecommunications Board of the National Research Council, commissioned by Librarian of Congress James H. Billington, supported meeting the nationwide challenges of preserving electronic resources, in particular, the so-called "born digital" materials those that are created or distributed primarily, if not exclusively, in digital form, for which there is no analog equivalent and which are particularly vulnerable to complete or partial loss.
NDIIPP funds are to be released in stages$5 million was immediately authorized to support planning; $20 million is to be made available after the submission of the plan, (which is scheduled for delivery to Congress later this year); and the final $75 million will be contingent upon raising as much as $75 million in matching funds. The legislation directed LC to develop the plan in consultation with concerned industries, other major federal libraries and research institutions and not-for-profit organizations, including the Council on Library and Information Resources (CLIR), OCLC and others. Since spring 2001, LC has undertaken a planning process that involves four dimensions: stakeholder meetings from which to listen and learn; collaborative research with other federal agencies; conceptual framework for organizing relationships and technologies; and scenario planning to surface tacit assumptions and to explore unusual options and alternatives. A 27-member advisory board has been organized with representatives from a cross-section of industries, concerned federal agencies (Department of Commerce, NIST, National Science Foundation (NSF), Office of Science and Technology Policy), not-for-profit institutions and major libraries, with representatives from the National Library of Medicine (NLM), National Archives and Records Administration (NARA), National Agricultural Library (NAL) and the British Library.
Consultations with representatives from publishing, entertainment, broadcasting, libraries and non-profit organizations in a series of three invitation-only sessions in November 2001, and a second set of more focused scenario planning workshops in the winter and spring of 2002, elicited a sense of what was happening outside of the traditional library and archival communities. There was surprising (and welcome) support across industries, where, for example, the recording industry has already made progress in thinking about the technical issues.
A number of themes surfaced across the sessions. Digital assets are degrading at a rapid rate; proprietary and/or obsolete formats inhibit the ability to recover or reuse these valuable materials (known as the "playback" problem). Some sort of decentralized or distributed solution around a common core of standards or best practices is desired, but several technical experts warned planners not to underestimate the complexities of the different formats. Managing copyrighted materials is challenging but intellectual property protection is not the only or even the most difficult challenge. A balance between the economic rights of the rights holders and the legitimate interests of the public, particularly to support education and scholarship, would be required. Copyright in the digital world is a thorny topic. LC has commissioned a white paper, which discusses the key issues and will be available via the NDIIPP Web site, launched last spring (www.digitalpreservation.gov).
The challenges sketched in that paragraph will probably not surprise most librarians and archivists. As many information professionals know, digital content is diverse, spanning public records, business records, databases of scientific and social scientific information, scholarly journals, Web sites, multimedia works and so on. Since it is probably neither necessary nor prudent to save everything for everyone all of the time, the process of selection begs a couple of questions: How to save what for whom? Who shall be responsible for what?
Over time, the library and archives system in the U.S. and internationally has worked outsometimes informallymethods of cooperation, information exchange and redundancy. This network of relationships will be reworked in the digital era, and any library or archive that acquires or uses digital resources will also face decisions about them. Here is some of what LC has learned.
Technology and Media: How To Save What?
Technological obsolescence is a core issue and managing it challenges traditional notions of preservation. According to Professor Margaret Hedstrom of the University of Michigan, "Basically, unlike preservation of traditional materials, where we try to create a stable surrogate or stabilize the original material, we just have to accept that as technology changes, we need to move the materials forward through time. That involves varying degrees of changing them." At a minimum, data must be periodically copied from one medium to another (even though the process, known as "refreshing the data," does potentially introduce errors and raise questions of authenticity) because the storage medium degrades. Although it is theoretically possible to maintain a laboratory or museum of any conceivable combination of hardware and software, it is, in fact, a daunting prospect. Mike Williams, chief curator at the Computer History Museum at Moffett Field in Mountain View, California, says it is "virtually impossible to keep everything working." All of the components of the equipment degrade, replacement parts are unobtainable and the power supply is particularly fragile. Plug in a large machine from about 1960, and "you're likely to get a flash and a bang."
Since physically preserving the environment is not practical, digital preservationists have debated the merits of different strategies (e.g., documentation, migration, emulation, encapsulation) as well as the criteria for preservation metadatawhat information about the object is associated with the object to enable the system to interpret and display it. Where you come down depends to a large degree on how important it is to retain the technical environment within which the item was created and how the metadata are defined.
Indeed, different strategies may be appropriate for different types of data, a point that McCrary and his colleagues at NIST elucidate in their literature review for the Journal of Research of the National Institute of Standards and Technology (January-February 2002). They argue, emulation, a strategy that enables one computer to mimic the behavior of another, thus preserving the "look and feel" as well as the functionality, can be viable for resources for which future value and use may be unknown and for which preserving the look and feel are important. Games, visualization and computer-enabled scientific experiments common in biotechnology are examples of these kinds of resources. On the other hand, migration, which transforms a resource in one format to another, will be more appropriate for resources that are actively accessed and where content is primary. For example, a team in LC's Information Technology Services (ITS) reports that since the 1960s, ITS has been responsible for maintenance of the catalog that now consists of more than 12 million bibliographic records and 4 million authority records in electronic form. Since then, these data have migrated through three internal software formats, three types of tape formats and six types of hard disk drives. All the while LC has continued to exchange records seamlessly with other libraries in the common interchange format, MARC, which was developed under LC's leadership for this purpose.
Recent discussions among researchers have turned on the conceptual technical architecture, a term defined by whatis.com as "both the process and the outcome of thinking out and specifying the overall structure, logical components and the logical interrelationships of a computer, its operating system, a network, or other conception" (whatis.techtarget.com/). Rather than debating the merits of individual strategies, many investigators at NSF, NIST, NARA, the San Diego Super Computer Center and MIT are asking a higher-level question: How do the organizational and technological pieces fit together? Then, within that framework, what are the functional components of a system and what are the roles of the participating agencies and institutions? Viewed from that perspective, LC is both a nodea collection or set of collectionsas well as a convener and facilitator of the larger process.
A small team of experts convened by LC in early April 2002 is exploring the viability of a four-tier conceptual technical architecture. The lowest level is the "repository," which is essentially a safe-deposit box for bitsthe 0s and 1s of digital data. The repository only needs to know that it is storing bits, and to provide a way of accessing them. It does not need to know about particular file formats, data encoding schemas or DRM mechanisms. At the next level, the "gateway," control is exercised over which requests may access data stored in the repository. The third level, "collections," supports many of the functions and decisions associated with professional librarians and archivists, including the technical information that provides context to the bits from the repository, such as whether it is a photo or a film and what software is needed to render it properly. The collection is also where the terms and restrictions are recorded for access and use of digital items. Finally, an "interface layer" exists where patrons can access information.
Here's how it might work. The string of bits (0s and 1s) associated with a scholarly journal article might be stored in a repository operated by a trusted third party with expertise in data warehousing. The "gateway" level knows the address of the data that makes up the article, and also knows to return the set of bits when it receives requests from an authorized Collection. Three independent institutions, A, B and C are authorized as Collections: a hospital, a corporate biotech research laboratory and a nonprofit foundation that does public health and environmental studies. A, B and C each produce different metadata associated with the article itself, augmenting the metadata that they receive from the publisher according to the priorities of their separate institutions; they also support different interfaces for the scientists who can access the system from their offices as well as from the libraries. The divisions mean that the individual libraries can customize services for their patrons and can restrict access, if necessary, according to contracts they have independently negotiated. Perhaps staff physicians at the hospital (Collection A) can access the research literature but access within that institution is not extended to administrative employees. Tasks associated with maintaining the stored data can be delegated to other parties.
This is, of course, a conceptual architecture. There remain many issues to be worked out and many ways in which the architecture could be implemented. For example, it might be easy for one collection to "know" about the contents of a second collection managed by another institution. Or it might be made difficult to see from one collection to another if their contents were very sensitive, perhaps containing proprietary or copyrighted materials. Similarly, the different preservation strategies outlined in the preceding paragraphs might be deployed in different ways depending on the attributes of the data.
Collections: Why, What,
For How Long and For Whom?
Deciding which attributes of the data are important requires some assumptions about the value of the data to future users. Libraries, Hedstrom says, "are explicitly in the business of assembling and organizing materials for a known community." She adds that differences among different types of libraries are likely to become obvious when making decisions about the scope of the collections and their management, including long-term preservation. The Web, Hedstrom continues, is forcing librarians to rethink their assumptions about collection development and management, in part because the user community really cannot be known with any certainty. On the other hand, we can know quite a bit about the producer communitieshow the digital resources were created and for whom, whether it is scientific research, a dynamic database or a dissertation in the performing arts.
Libraries license certain types of resources from publishers, notably the electronic journals; the content itself is not acquired. Thus, the long runs of serials that have evolved into valuable resources at many research libraries are no longer created as a by-product of acquisitions policies. The significance of leased versus owned content has not gone unnoticed. For example, the Andrew W. Mellon Foundation, NSF, Stanford University Libraries and Sun MicroSystems, Inc., support the LOCKSS (Lots of Copies Keeps Stuff Safe) program (http://lockss.stanford.edu/), which is working to build a distributed e-journal archive. Victoria Reich, the program director, says this approach "has the potential to restore to librarians their ability to build and retain for the long-term local e-journal collections." Other important projects in digital archiving that consider a broad range of materials (including but not focused on electronic journals) are under way at the U.S. National Archives and Records Administration, U.S. National Library of Medicine, MIT and other major research universities, as well as in the private sector.
Early in its planning, LC decided to focus the first phase on so-called "born digital" information, which was more vulnerable to loss than materials that had been converted from another format, and to organize the initial efforts around formats in which LC's collections are strong or where the digital materials are aligned with LC's traditional, broad mission: Web sites, electronic journals, electronic books, digitally recorded sound, digital television and digital moving images. Six environmental scans, which assess the baseline for these formats, were commissioned and are available at the program's Web site (www.digitalpreservation.gov).
One way to generalize the decision-making process, says Hedstrom, is to consider the producer and user communities within the framework set forth in the Open Archival Information System (OAIS) Reference Model, which was initially developed by the space science data community and has been approved as an ISO standard. Based on relatively limited experience, Hedstrom argues, successful digital collections have been those in which the producer and user communities are tightly integrated and fundamentally agree on the content and value of the archivethat is, the institutional mission. "The National Archives' most important customers," says Kenneth Thibodeau, director of NARA's Electronic Records Archives, "will not be born for a hundred years or more." So, the agency, which collects federal records that enable Americans to "understand our national experience," is working "to develop an electronic records archives that is capable of taking collections of virtually any type of electronic records produced anywhere in the federal government, and making them available to Americans now and in the future."
Betsy Humphreys, the associate director of NLM, says it is safe to assume users will continue to have general expectations about what is likely to be in a particular digital collection, just as they do for a particular analog collection. Thus, NLM's collections will continue to document the scholarly record in biomedicine. People won't approach NLM's digital repository expecting "to find a comprehensive collection of cartoons with doctors in them."
Although NDIIPP focuses on digital information, this does not mean that older research results necessarily lose their value. For example, the definitive works for both smallpox and tuberculosis predate the electronic era. From a public health point of view, these texts have become extremely important to providers who are accustomed to working in the digital medium. NLM approached the rights holders of these key works and learned that, in most cases, either the work had already been posted, was in the process of being scanned or permission was given to NLM to scan it, according to Humphreys.
Redundancy is one of the organizational strengths of the library system that should be retained in digital archiving systems, Humphrey says. This view is shared by LOCKSS. The risk is that the expense of maintaining collections, coupled with the ease of sharing information across the networks and over specialization among libraries and archives, will eliminate the duplication that has resulted in a robust national and international system of libraries and archives. For example, the collections at the Law Library at LC cover materials from all existing nations and dependencies, as well as many former nations and colonies.
As LC and other organizations begin to evolve a national strategy of shared responsibilities, redundancy among collections matters. Is an institution that accepts the archival responsibility going to be willing and able to sustain that commitment? Humphreys asks rhetorically. Is preservation aligned with the business plan of the institution? Is there a failsafe or a collector of last resort?
Whither?
Possibly the most important lessons that LC has learned since initiating the work are: (1) technology is important but will not solve "the problem"; (2) there is no single problem and management and organizational missions matter; and (3) LC can facilitate development of a system to achieve long-term digital preservation but cannot do it alone.
Many of those distinctions that have evolved among libraries, records centers and archives will probably persist into the future depending on who collects what, for how long and for whom. Within a broad consensus of technological architectures, best practices and professional codes of conduct, libraries and archives may be expected to pursue their missions in a loosely coordinated way much as they do now. No one expects a rock-solid system to function tomorrow or even the day after. But as the workshop participants put itand with acknowledgement to Nike, Inc.sometimes you just have to start doing it.
The views and opinions expressed herein are those of the author and do not necessarily reflect those of the Council on Library and Information Resources, the Library of Congress or the Government. I am indebted to the following people for their assistance: Martha Anderson, Lewis Bellardo, Laura Campbell, Robert Dizard, John Garrett, Margaret Hedstrom, Betsy Humphreys, Molly Ho Hughes Johnson, Guy Lamolinara, Clifford Lynch, Jane Mandelbaum, Deanna B. Marcum, Victor R. McCrary, Victoria Reich, Clay Shirky, Kenneth Thibodeau, Kevin Vest and Mike Williams. Remaining errors are the responsibility of the author.



Feedback form