National leadership grants 9781845441883, 9781845440244

Its National Leadership Grant (NLG) program is one of the many ways in which the Institute of Museum and Library Service

164 57 2MB

English Pages 93 Year 2004

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

National leadership grants
 9781845441883, 9781845440244

Citation preview

lht_cover_(i).qxd

04/05/2005

13:26

Page 1

Volume 22 Number 3 2004

ISBN 0-84544-024-2

ISSN 0737-8831

Library Hi Tech National leadership grants Guest Editors: Timothy W. Cole and Sarah L. Shreeves

Library Link www.emeraldinsight.com/librarylink

www.emeraldinsight.com

Library Hi Tech Volume 22, Number 3, 2004

ISSN 0737-8831

National leadership grants Guest Editors: Timothy W. Cole and Sarah L. Shreeves

Contents 242 Access this journal online 243 Abstracts & keywords GUEST EDITORIAL 246 The IMLS NLG program: fostering collaboration Timothy W. Cole and Sarah L. Shreeves

Theme articles 249 Connecting people and resources: digital programs at the institute of museum and library services Joyce Ray 254 The Colorado digitization program: a collaboration success story Brenda Bailey-Hainer and Richard Urban 263 Metadata rematrixed: merging museum and library boundaries Priscilla Caplan and Stephanie Haas 270 Online multimedia museum exhibits: a case study in technology and collaboration Matthew F. Nickerson

277 An online guide to Walt Whitman’s dispersed manuscripts Katherine L. Walter and Kenneth M. Price 283 The Maine music box: a pilot project to create a digital music library Marilyn Lutz 295 Enabling technologies and service designs for collaborative Internet collection building Steve Mitchell, Julie Mason and Lori Pender 307 Search and discovery across collections: the IMLS digital collections and content project Timothy W. Cole and Sarah L. Shreeves

Columns 323 ARCHITECTURAL The way ahead: learning cafe´s in the academic marketplace Morell D. Boone 328 ON COPYRIGHT Copyright in the networked world: interlibrary services Michael Seadle

Access this journal electronically The current and past volumes of this journal are available at:

www.emeraldinsight.com/0737-8831.htm You can also search over 100 additional Emerald journals in Emerald Fulltext:

www.emeraldinsight.com/ft See page following contents for full details of what your access includes.

www.emeraldinsight.com/lht.htm As a subscriber to this journal, you can benefit from instant, electronic access to this title via Emerald Fulltext. Your access includes a variety of features that increase the value of your journal subscription.

How to access this journal electronically To benefit from electronic access to this journal you first need to register via the Internet. Registration is simple and full instructions are available online at www.emeraldinsight.com/ rpsv/librariantoolkit/emeraldadmin Once registration is completed, your institution will have instant access to all articles through the journal’s Table of Contents page at www.emeraldinsight.com/0737-8831.htm More information about the journal is also available at www.emeraldinsight.com/lht.htm Our liberal institution-wide licence allows everyone within your institution to access your journal electronically, making your subscription more cost effective. Our Web site has been designed to provide you with a comprehensive, simple system that needs only minimum administration. Access is available via IP authentication or username and password.

E-mail alert services These services allow you to be kept up to date with the latest additions to the journal via e-mail, as soon as new material enters the database. Further information about the services available can be found at www.emeraldinsight.com/ usertoolkit/emailalerts Emerald WIRE (World Independent Reviews) A fully searchable subject specific database, brought to you by Emerald Management Reviews, providing article reviews from the world’s top management journals. Research register A web-based research forum that provides insider information on research activity world-wide located at www.emeraldinsight.com/researchregister You can also register your research activity here. User services Comprehensive librarian and user toolkits have been created to help you get the most from your journal subscription. For further information about what is available visit www.emeraldinsight.com/usagetoolkit

Choice of access Key features of Emerald electronic journals Automatic permission to make up to 25 copies of individual articles This facility can be used for training purposes, course notes, seminars etc. This only applies to articles of which Emerald owns copyright. For further details visit www.emeraldinsight.com/copyright Online publishing and archiving As well as current volumes of the journal, you can also gain access to past volumes on the internet via Emerald Fulltext. Archives go back to 1994 and abstracts back to 1989. You can browse or search the database for relevant articles. Key readings This feature provides abstracts of related articles chosen by the journal editor, selected to provide readers with current awareness of interesting articles from other publications in the field. Reference linking Direct links from the journal article references to abstracts of the most influential articles cited. Where possible, this link is to the full text of the article. E-mail an article Allows users to e-mail links to relevant and interesting articles to another computer for later use, reference or printing purposes.

Additional complementary services available Your access includes a variety of features that add to the functionality and value of your journal subscription:

Electronic access to this journal is available via a number of channels. Our Web site www.emeraldinsight.com is the recommended means of electronic access, as it provides fully searchable and value added access to the complete content of the journal. However, you can also access and search the article content of this journal through the following journal delivery services: EBSCOHost Electronic Journals Service ejournals.ebsco.com Huber E-Journals e-journals.hanshuber.com/english/index.htm Informatics J-Gate www.j-gate.informindia.co.in Ingenta www.ingenta.com Minerva Electronic Online Services www.minerva.at OCLC FirstSearch www.oclc.org/firstsearch SilverLinker www.ovid.com SwetsWise www.swetswise.com TDnet www.tdnet.com

Emerald Customer Support For customer support and technical help contact: E-mail [email protected] Web www.emeraldinsight.com/customercharter Tel +44 (0) 1274 785278 Fax +44 (0) 1274 785204

funded more than 100 exemplary digitization projects through its National Leadership Grant program. Collectively, these projects have helped to identify best practices for the creation, management, preservation and use of digital content. Most importantly, they demonstrate the important role that museums and libraries can play in supporting both formal education and lifelong learning. Ultimately, this work will help libraries and museums to fulfill their roles as educational institutions. IMLS grants support the spectrum of learning from independent inquiry through formal education to the development of “learning communities.”

Abstracts & keywords

The Colorado digitization program: a collaboration success story

The IMLS NLG program: fostering collaboration

Brenda Bailey-Hainer and Richard Urban Timothy W. Cole and Sarah L. Shreeves

Keywords Libraries, Museums, Cultural synergy, Heritage

Keywords Grants, Museums, Archives, United States of America Its National Leadership Grant (NLG) program is one of the many ways in which the Institute of Museum and Library Services (IMLS) supports the development of innovative new projects and services by the museum, library, and archival community in the USA. Over the course of the NLG program, collaboration has emerged as one of the several strategic approaches that engender success. Digital projects, which can be complex in execution and which often require a diverse range of skills and resources, benefit especially from collaborative approaches. The IMLS NLG program has encouraged a wide range of collaborations, across a diversity of organization types and at a diversity of levels.

The Colorado Digitization Program has received several IMLS Leadership Grants. The Heritage Colorado and Western Trails grant projects both involved extensive collaboration between libraries, museums, historical societies and archives. Successful collaborative activities included creating best practices, metadata and scanning standards, training, metadata input tools, technological interoperability, and funding strategies.

Metadata rematrixed: merging museum and library boundaries Priscilla Caplan and Stephanie Haas Keywords Museums, Taxonomy, Z39.50, Geographic information systems

Connecting people and resources: digital programs at the institute of museum and library services Joyce Ray Keywords Digital libraries, Museums, Libraries, Education, Learning As a federally-funded independent granting agency, The Institute of Museum and Library Services (IMLS) became involved in digitization in the late 1990s when Congress gave it statutory authority to fund digitization of library and museum collections. Since that time, IMLS has Library Hi Tech Volume 22 · Number 3 · 2004 · Abstracts & keywords q Emerald Group Publishing Limited · ISSN 0737-8831

Linking Florida’s Natural History uses species information as the nexus for pulling together scientific data from museum specimen databases and library catalogs of scientific literature. The goals of the IMLS funded project were to integrate specimen records and bibliographic records about the same species; to create an interface equally easy for scientists, students and laymen to use; and to enhance bibliographic description to make it more usable in a taxonomic and environmental context. Although some development was required to enable Z39.50-based broadcast search across bibliographic and specimen collections, the bulk of the work was devoted to identifying and overcoming inconsistencies between the resource description practices of libraries and museums. Enriching records with taxonomic and geographic information was also a challenge.

243

Abstracts & keywords

Library Hi Tech Volume 22 · Number 3 · 2004 · 243–245

Online multimedia museum exhibits: a case study in technology and collaboration

The Maine music box: a pilot project to create a digital music library

Matthew F. Nickerson

Keywords Music, Audiovisual media, Digital libraries

Marilyn Lutz

Keywords Grants, Libraries, Museums, Multimedia, Online operations Eight partners including three university libraries and five regional museums worked together to create the Voices of the Colorado Plateau online exhibit. This site features multimedia exhibits that combine oral history recordings and historic photographs to create a new and engaging online museum experience. Computer and telecommunication technologies were vital in the collaboration, creation and dissemination processes. Collaborative projects among libraries and museums can capitalize on both similarities and differences between these culture heritage institutions. Working in consortia can produce results that cannot be achieved alone. Both number and geographical separation of the partners in this project represent a unique level of cooperation and integration. The extensive use of oral history in a multimedia museum exhibit is also unique to this project.

The Maine Music Box is an interactive, multimedia digital music library that enables users to view images of sheet music, scores and cover art, play back audio and video renditions, and manipulate the arrangement of selected pieces by changing the key and instrumentation. In this pilot project the partners are exploring the feasibility and obstacles of combining collections, digital library infrastructure, and technical and pedagogical expertise from different institutions to implement a digital music library and integrate it into Maine’s classrooms. This paper describes the methodology for digitizing, processing and providing access to electronic resources owned by two libraries and hosted by another, and the use of those collections to develop an instructional tool keyed to the digital library.

Enabling technologies and service designs for collaborative Internet collection building Steve Mitchell, Julie Mason and Lori Pender Keywords Collecting, Internet, Classification, Portals

An online guide to Walt Whitman’s dispersed manuscripts Katherine L. Walter and Kenneth M. Price Keywords Digital libraries, Archives In November 2002, with funding from the Institute of Museum and Library Services, the University of Nebraska-Lincoln and the University of Virginia embarked on a project to create a unified finding aid to Walt Whitman manuscript collections held in many different institutions. By working collaboratively, the project team is developing a finding aid that is tailored to the needs of Whitman scholars while following a standard developed in the archival community, encoded archival description (EAD). XSLT stylesheets are used to harvest information from various repositories’ finding aids and to create an integrated finding aid with links back to the original versions. Digital images of poetry manuscripts and descriptive information contribute to an ambitious thematic research collection. The authors describe the National Leadership Grant project, identify key technical issues being addressed, and discuss collaborative aspects of the project.

The following describes a number of technologies and exemplary service designs that foster better Internet finding tools in libraries and more cooperative and efficient effort in Internet resource collection building. Our library and partner institutions have been involved in this work for over a decade. The open source software and projects discussed represent appropriate technologies and sustainable strategies that will help Internet portals, digital libraries, virtual libraries and library catalogs-with-portal-like-capabilities (IPDVLCs) to scale better and to anticipate and meet the needs of scholarly and educational users.

Search and discovery across collections: the IMLS digital collections and content project Timothy W. Cole and Sarah L. Shreeves Keywords Collecting, Information retrieval, Digital storage, Grants In the fall of 2002, the University of Illinois Library at Urbana-Champaign received a grant from the Institute of Museum and Library Services (IMLS) to implement a collection registry and item-level metadata repository for digital collections and

244

Abstracts & keywords

Library Hi Tech Volume 22 · Number 3 · 2004 · 243–245

content created by or associated with projects funded under the IMLS National Leadership Grant (NLG) program. When built, the registry and metadata repository will facilitate retrieval of information about digital content related to past and present NLG projects. The process of creating these services also is allowing us to research and gain insight into the many issues associated with implementing such services and the magnitude of the potential benefit and utility of such services as a way to connect, bring together, and make more visible a broad range of heterogeneous digital content. This paper describes the genesis of the project, the rationale for architectural design decisions, challenges faced, and our progress to date. The way ahead: learning cafe´s in the academic marketplace Morell D. Boone Keywords Learning, Architecture, Library facilities Libraries, like the universities they serve, are faced with the daunting task of reconciling the traditional role as repository and provider of information with the increasing demands of a market-driven society. Learning cafe´s can provide a place where these two divergent demands are potentially reconciled. By providing sophisticated technologies within a

sociable environment, learning cafe´s seek to enhance the potential for interactive learning among its users. They have the potential to be hosts for an increasingly diverse array of emerging library services. Before incorporating a learning cafe´ within new or existing libraries, however, planners must keep in mind the types of learning best suited for this type of area and maintain a flexible design model so that the cafe´ can be adapted to future needs. Copyright in the networked world: interlibrary services Michael Seadle Keywords Copyright law, Interlending, Canada, Germany, United States of America Interlibrary lending and document delivery have become an integral part of the services that contemporary libraries offer. The copyright laws in most countries authorized this copying within reasonable limits, but tensions with publishers may be growing. For interlibrary services to remain effective, libraries must continue to lobby politicians to defend their legal basis. Libraries must also continue to work with publishers to address legitimate economic concerns. This paper looks at the legal basis for interlibrary services, particularly document delivery, in the US, Canadian, and German law.

245

Guest editorial The IMLS NLG program: fostering collaboration Timothy W. Cole and Sarah L. Shreeves

The authors Timothy W. Cole and Sarah L. Shreeves are both based at the University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Keywords Grants, Museums, Archives, United States of America

Abstract Its National Leadership Grant (NLG) program is one of the many ways in which the Institute of Museum and Library Services (IMLS) supports the development of innovative new projects and services by the museum, library, and archival community in the USA. Over the course of the NLG program, collaboration has emerged as one of the several strategic approaches that engender success. Digital projects, which can be complex in execution and which often require a diverse range of skills and resources, benefit especially from collaborative approaches. The IMLS NLG program has encouraged a wide range of collaborations, across a diversity of organization types and at a diversity of levels.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 246–248 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560035

An independent grant-making federal agency, the Institute of Museum and Library Services (IMLS)[1] helps to support and foster the growth and development of museums, libraries, and archives throughout the US via a number of targeted programs. Notable among these is the Institute’s National Leadership Grant (NLG) program. IMLS has been making NLG awards supporting research, assessment, digitization, and the implementation of innovative services since 1998. Over the years a number of recurring themes and threads have emerged, common approaches and strategies that consistently have made for successful and productive projects. Collaboration, at multiple levels and in various ways, has been and continues to be an important attribute of many of the NLG program’s most successful projects, especially those projects and grants focused on the application of new technologies in museums, libraries, and archives. Speaking last December in Geneva, Switzerland at a session on the Role of Science in the Information Society, Robert Martin, Director of IMLS, noted that: This convergence of resources and assets through the potential of digital technology is also spawning new organizational strategies. A key strategy is collaboration, not only among museums and libraries but with the formal educational structure, public broadcasting stations, the private sector, and civil society. Institutions cannot afford to be islands: to achieve their educational missions, they need to work together (Martin, 2003).

This theme issue of Library Hi Tech highlights a few of the many exemplary NLG-funded projects that have exploited or are exploiting collaboration. Our selection is by no means exhaustive, nor even especially systematic. We touch here only the tip of the iceberg of what’s been accomplished in the far-ranging and very successful, albeit still relatively brief history of the IMLS NLG program. Limits of space and time preclude inclusion of all the NLG program collaborative success stories, but in this issue are articles about seven projects that we feel illustrate admirably the range and scope of collaboration on digital projects funded under NLG auspices. Included are articles featuring state-wide and regional collaborations between multiple types of organizations: . “The Colorado Digitization Program: A Collaboration Success Story” (Colorado). . “Metadata Rematrixed: merging museum and library boundaries” (Florida). . “Online Multimedia Museum Exhibits: A Case Study in Technology and Received: 2 June 2004 Revised: 2 June 2004 Accepted: 14 June 2004

246

The IMLS NLG program: fostering collaboration

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 246 –248

Collaboration” (the Colorado Plateau region of the Western US).

In terms of infrastructure commitments, it is even more important these days that we decide what we do and do not have the capacity to do; to focus on our core competencies and find partners to match. As a museum, we see ourselves as an organization with rich content that is of great value for lifelong learning, a great public venue for learning, entertainment, and social interaction, and a scientific authority in our areas of specialty. We need partners to fulfill other aspects of actualizing our digital plans, specifically in terms of information distribution. Web-Wise fulfills an important role in our community of bringing together partners that play different roles so that they can build on each other strengths (Barnett, 2004).

Two more articles in this issue feature communities of interest that have coalesced to spawn successful and wide-ranging collaborations between information specialists (librarians, curators, and information technologists) and subject specialist end-users (students, teachers, and scholars). . “An Online Guide to Walt Whitman’s Dispersed Manuscripts.” . “The Maine Music Box: a Pilot Project to Create a Digital Music Library.” The final two projects deal with ongoing research into and demonstrations of key infrastructure components that take advantage of the opportunities afforded by new technologies to facilitate and enable collaboration in digital library building at a high level between experts with diverse skills and backgrounds and widely dispersed geographically: . “Enabling Technologies and Service Designs for Collaborative Internet Collection Building.” . “Search and Discovery across Collections: the IMLS Digital Collections and Content Project.” Introducing these seven articles and placing each in the context of the overall IMLS NLG program is a paper by Joyce Ray, the Associate Deputy Director for Library Services at IMLS. The projects featured in this theme issue cover the waterfront not only in terms of the scope and nature of collaborations described but also in terms of the technologies employed and the media types of the digital content developed and used. They testify to the benefits of collaboration in today’s increasingly digital environment. In closing we should note that no discussion of the importance of collaboration to IMLS would be complete without mention of the Web-Wise Conference. Sponsored by IMLS and held annually since 2000, this meeting focuses on the many ways which libraries, museums, archives, and other related organizations are providing digital content and services. Collaboration has been a recurring theme at all the Web-Wise meetings. One museum staff attendee at the very first Web-Wise described this initial conference as a “best practice clinic for how to share resources.” (Web-Wise Conference, 2000) Five years later, in one of the concluding sessions of the most recent Web-Wise Conference earlier this year, Bill Barnett, Vice-President and Chief Information Officer at The Field Museum in Chicago, spoke also about collaborative partnerships and the importance of Web-Wise as a venue for learning about and developing such partnerships:

Collaboration is nothing new for libraries, but as these articles illustrate, and as is clear looking back through past proceedings of Web-Wise, collaboration has an even greater potential to enhance and facilitate the way we provide digitallybased content and services to our users. We hope that the papers in this special issue will not only inform Library Hi Tech readers about this important aspect of the IMLS NLG program, but also will engender and encourage future productive collaborations.

Note 1 Institution of Museum and Library Services, available at: www.imls.gov/

References Barnett, B. (2004), “On my mind: commentary on Web-Wise”, First Monday, Vol. 9 No. 5, available at: www. firstmonday.org/issues/issue9_5/barnett/index.html (accessed 29 May 2004). Martin, R.S. (2003), “Libraries, museums, and nations of learners: new opportunities for the knowledge society”, paper presented at Role of Science in the Information Society Education Session, in conjunction with World Summit on the Information Society, Geneva, Switzerland, 9 December 2003, available at: www.imls.gov/scripts/ text.cgi?/whatsnew/current/sp120903.htm (accessed 29 May 2004). Web-Wise Conference (2000), “A conference on libraries and museums in the digital world. Introduction”, First Monday, Vol. 5 No. 6, available at: www.firstmonday.org/issues/ issue5_6/introduction/index.html (accessed 29 May 2004).

About the Guest Editors Timothy W. Cole is Mathematics Librarian and Professor of Library Administration at the University of Illinois at Urbana-Champaign where he has been a member of the Library faculty since

247

The IMLS NLG program: fostering collaboration

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 246 –248

1989. He has held prior appointments at Illinois as Assistant Engineering Librarian for Information Services and Systems Librarian for Digital Projects. He is Principal Investigator for the University of Illinois IMLS Digital Collections and Content Project. E-mail: [email protected] Sarah L. Shreeves is Visiting Assistant Professor of Library Administration and Project

Coordinator for the University of Illinois IMLS Digital Collections and Content Project. Previously she was a Project Coordinator for the University of Illinois Open Archives Initiative Metadata Harvesting Project funded by the Andrew W. Mellon Foundation. From 1992 to 2001 she was a member of staff at the Massachusetts Institute of Technology Libraries. E-mail: [email protected]

248

Introduction

Theme articles Connecting people and resources: digital programs at the institute of museum and library services

The Institute of Museum and Library Services (IMLS) was created by Congress in 1996 to enhance museum and library services nationwide and to provide coordination among and between museums and libraries. IMLS is the only federal funding agency with statutory authority to fund digitization. Other agencies also support digitization, of course, but the statutory authority means that Congress has indicated that IMLS should make increased online public access to library and museum content an agency priority. IMLS uses its National Leadership Grant program not only to support the digitization of significant resources but also to: . develop and disseminate best practices for digitization; . develop tools to manage digital content; . focus research on the creation, management, use and preservation of digital resources; and . encourage the use of technology to create dynamic learning opportunities.

Joyce Ray

The author Joyce Ray is Associate Deputy Director for Library Services based at the Institute of Museum and Library Services, Washington, District of Columbia, USA.

Keywords Digital libraries, Museums, Libraries, Education, Learning

Abstract As a federally-funded independent granting agency, The Institute of Museum and Library Services (IMLS) became involved in digitization in the late 1990s when Congress gave it statutory authority to fund digitization of library and museum collections. Since that time, IMLS has funded more than 100 exemplary digitization projects through its National Leadership Grant program. Collectively, these projects have helped to identify best practices for the creation, management, preservation and use of digital content. Most importantly, they demonstrate the important role that museums and libraries can play in supporting both formal education and lifelong learning. Ultimately, this work will help libraries and museums to fulfill their roles as educational institutions. IMLS grants support the spectrum of learning from independent inquiry through formal education to the development of “learning communities.”

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 249–253 Emerald Group Publishing Limited · ISSN 0737-8831

All of these efforts will ultimately help libraries and museums to fulfill their roles as educational institutions. Having high-quality library and museum resources, easily accessible online will enhance formal education at all levels, support independent inquiry, and encourage lifelong learning. IMLS digital programs support the agency’s basic mission to build the capacity of museums and libraries to create and sustain a “nation of learners.”

National leadership grant program IMLS awarded the first National Leadership Grants in 1998. To date, more than 100 National Leadership projects have been funded to create digital content. From the beginning, IMLS has emphasized the development of quality standards and the dissemination of best practices for digitization, interoperability, information discovery and preservation to ensure that digital content will have maximum value and long-lasting impact. As projects have reached maturity, we have all learned from the experiences and results. One of the best-known National Leadership projects is the Colorado Digitization Program (CDP), first funded in 1999[1]. Under the Received: 29 January 2004 Revised: 26 April 2004 Accepted: 13 June 2004 q This work was prepared by a US government employee and, therefore, is excluded from copyright by Section 105 of the Copyright Act.

249

Connecting people and resources

Library Hi Tech

Joyce Ray

Volume 22 · Number 3 · 2004 · 249–253

direction of Liz Bishoff and Nancy Allen at the University of Denver, this project has been a model of practice for the creation and management of large-scale digital content and for collaboration among multiple institutions. The program currently involves more than 40 Colorado institutions of different sizes and types, including libraries, museums, archives and historical societies. The CDP has expanded beyond Colorado and has had a significant influence on digital practices in the US. More than 15 states have instituted statewide digital programs based on the Colorado model. The CDP has also been a leader in conducting user studies for the better understanding of the needs and expectations of teachers and other primary audiences, involving teachers in content selection, and training teachers to find and use digital resources.

Urbana-Champaign; Anne Craig, Illinois State Library; Dan Greenstein, Digital Library Federation; Doug Holland, Missouri Botanical Garden; Ellen Kabat-Lensch, Eastern Iowa Community College; Tom Moritz, American Museum of Natural History; and John Saylor, Cornell University. The group met with representatives of the National Science Foundation’s National Science Digital Library to discuss issues of infrastructure, metadata and content enrichment for educational applications. The IMLS-supported Forum went on to develop “A Framework of Guidance for Building Good Digital Collections” as a guide for digitization projects[3]. The framework established principles of good practice and identified current standards in four areas: collections, objects, metadata, and projects. The framework has been endorsed by the Digital Library Federation and the Chief Officers of State Library Agencies. The National Information Standards Organization has now adopted the framework and assumed responsibility for its ongoing maintenance. IMLS has also been concerned with making high-quality digital content more easily accessible to users. With an initial National Leadership grant to the University of California at Riverside in 1999, IMLS began supporting the development of INFOMINE to collect and disseminate librarianvetted scholarly information from the Web[4]. This project, under the direction of Steve Mitchell, has since received additional grants to develop new content discovery and management tools. But with many IMLS-funded digital collections positioned at deeper levels of institutional Web sites and with no centralized search capability across collections, much valuable content could remain unknown to prospective users. To address this issue, IMLS issued a Call for Proposals in 2002 for a project to build a metadata repository of IMLS-funded digital collections using the Open Archives Initiative Metadata Harvesting Protocol (OAI-PMH). The award was made to the University of Illinois at Urbana Champaign in September 2002[5]. This project, under the direction of Tim Cole, is involving IMLS grantees in a process to develop a collection-level registry and an item-level search function across the 100+collections that have been created with IMLS funding. A steering committee representing the library and museum communities is providing general guidance to the UIUC project staff and helping to define the data elements of the collection registry. The collection registry will be accessible to the public through the IMLS Web site. The item-level search function will initially not be publicly accessible, but could be released in the future after refinement and testing.

Communities, collaboration, and interoperability IMLS places a high priority on disseminating best practices and promoting discussion of digital issues within the library and museum communities, and in developing partnerships with the educational and computer science communities to address the spectrum of digital content management and use. In 2000, IMLS in collaboration with the University of Missouri at Columbia’s Department of Computer Science hosted the first Web-Wise Conference on the theme, “Libraries and Museums in the Digital World.” Held annually every year since, the Web-Wise Conferences showcase innovative digital projects and explore technical and social issues in the creation, management and use of digital content. This conference attracts a wide range of practitioners and researchers across the professional spectrum. Participants have come from libraries and museums of all types and sizes across the US as well as from Canada, the UK, Europe and Asia. Selected papers from the Web-Wise Conferences are collected and published each year as a special issue of the online journal First Monday[2]. IMLS has also promoted collaboration to address many challenges of connecting content with users. In 2001, IMLS supported a discussion group of experts from the museum and library fields to identify issues in the implementation and management of networked digital libraries. The Digital Library Forum was chaired by Priscilla Caplan of the University of Florida’s, Florida Center for Library Automation and included Liz Bishoff, Colorado Digitization Program; Tim Cole, University of Illinois

250

Connecting people and resources

Library Hi Tech

Joyce Ray

Volume 22 · Number 3 · 2004 · 249–253

IMLS continues to encourage the development and aggregation of large-scale digital resources. Such projects encourage the implementation of quality standards, promote interoperability, and enable small institutions, which often have significant resources that are relatively inaccessible, to make their collections more widely available. Collectively, the digitization projects that IMLS has funded through the National Leadership Grant program, aggregated through metadata harvesting, could serve as a prototype for a new national digital library of cultural heritage, similar to the National Science Digital Library for the sciences[6]. The aggregation of digital collections across different types of institutions and geographically dispersed locations will help users to exploit the full potential of the World Wide Web. Just as putting holdings online has greatly increased access to institutional collections previously available to only a handful of scholars, so aggregation through the construction of large-scale digital libraries and metadata harvesting promises to make vast amounts of digital content easier to discover. This potentially enormous knowledge base can support the needs and interests of students and independent learners on a wide variety of topics and will help educators, from kindergarten teachers to university faculties, find appropriate online content for classroom use. National Leadership grants have supported a number of projects to explore issues of interoperability, including the integration of different media formats (for example, GIS and text data in “Boston Streets: Mapping Directory Data,” directed by Greg Colati at Tufts University[7]; sound, video, and photographic images in “Voices of the Colorado Plateau,” directed by Matt Nickerson at Southern Utah University[8]; sound recordings, scores, cover art and lyrics in “Maine Music Box,” directed by Marilyn Lutz at the University of Maine[9]). National Leadership grants have also been funded to link together the related collections held by different institutions (such as, “vPlants,” a collaboration to develop an online searchable herbarium based on holdings of the Morton Arboretum, the Field Museum, and the Chicago Botanic Garden, under the direction of Christopher Dunn at the Morton Arboretum[10]; the University of Nebraska’s “Walt Whitman Archives” project, directed by Katherine Walter at the University of Nebraska, which is linking Whitman holdings from more than 60 institutions[11]; and the “Linking Florida’s Natural Heritage” project, under the direction of Priscilla Caplan, which created a virtual library of Florida ecological information from the holdings

of Florida libraries and museums[12]). The directors of all these projects have dealt first-hand with the challenges that inevitably arise in complex undertakings and have found practical ways to overcome them.

Future directions and targeted research It is through such hands-on work that needs are identified for further research, for tools to manage digital processes better, and for applications that will better serve users. To identify research needed to enhance digital content, IMLS supported a workshop in 2003 on “Research Opportunities on the Creation, Use and Preservation of Digital Resources.” The workshop was co-hosted by the University of Florida and organized by a committee composed of Bill Barnett, Field Museum; Liz Bishoff, CDP; Christine Borgman, University of California School of Education and Information; Priscilla Caplan, Florida Center for Library Automation (chair); Ken Hamma, J. Paul Getty Trust; Clifford Lynch, Coalition for Networked Information; and Rob Semper, Exploratorium. The workshop identified three principal areas in which further research is needed by both libraries and museums: knowledge integration, preservation, and the integration of digital and physical experiences. The full workshop report is posted on the IMLS Web site www.imls.gov/pubs/pdf/digitalopp.pdf IMLS has also solicited research to gain greater knowledge of users and their needs. In 2003, IMLS issued a Call for Proposals for a national study of the needs and expectations of users and potential users of online information. The award to the University of Pittsburgh, for a study directed by Jose-Marie Griffiths, was announced in September. The study will begin with a survey of the existing literature of information studies and museum audience research and identify common survey elements and assumptions. The literature survey will inform the design of a number of focus groups, and the findings from the focus groups will in turn contribute to the design of a national telephone survey in 2005. This project promises to give museums and libraries a better understanding of the information universe in which they, their users and audiences operate. The study will identify common survey questions that can inform future research and promote comparability of studies. A complementary project, also awarded in 2003, will provide useful information on specifics of why and how people use electronic information and how system design features affect the use and usability. This project is directed by Brenda Dervin

251

Connecting people and resources

Library Hi Tech

Joyce Ray

Volume 22 · Number 3 · 2004 · 249–253

at Ohio State University, in collaboration with OCLC. To encourage museums and libraries to explore their potential contributions to learning technologies, IMLS issued a Call for Proposals in 2003 for projects to demonstrate the potential of high-speed networks to support the development of innovative learning applications. This initiative called for use of the Internet2 network, with its capacity to deliver high-quality multimedia digital content, and for the development and demonstration of learning applications based on library and museum resources. IMLS made two awards for this initiative in 2003: one to the University of Maine for learning modules in science and Maine history; and the other to the WGBH Educational Foundation, in partnership with Washington University in St Louis and the Birmingham Civil Rights Institute, for a digital resource on the Civil Rights Movement. The Maine project is directed by Marilyn Lutz at the University of Maine and the Civil Rights project by Karen Cariani at WGBH. The Maine project will deliver content to public schools throughout the state, where all public schools have Internet2 connections. The Civil Rights project will draw together content from the holdings of the three partnering institutions, including television footage, photographs, oral histories, and other materials. This resource will support the educational programs at the university level at Washington University, elementary and secondary education curricula at the Birmingham Civil Rights Institute, and will provide content and curriculum materials for teachers through WGBH’s Teachers’ Domain Web site. IMLS is soliciting additional Internet2 projects in 2004. The aim of this initiative is to demonstrate the capacity of museums and libraries to develop dynamic, high-quality educational programming to support formal education and informal learning that can be delivered wherever and whenever it is wanted. Another important medium with great potential to deliver library and museum digital content is public broadcasting. Because both television and radio broadcasters now use digital technology, and because today’s audiences seek information in a variety of ways, broadcasters are maximizing the labor invested in programming to adapt and reuse content for delivery in multiple media formats, including the Internet. Broadcasters typically use only a small percentage of the raw material they collect in a finished program, but the unused material can have significant value for researchers and other users if it can be made accessible. Broadcasters can bring access to wider audiences and can help libraries and museums tell

compelling stories to engage audiences and encourage further exploration. Museums and libraries can assist broadcasters by providing nonrestricted digital content from their holdings and by providing subject expertise. Museums and libraries are generally perceived by the public as authoritative information sources, and they have the advantage of possessing physical facilities that often provide group spaces for community-based programs. A good partnership, for example, might involve a public television broadcast on current research in early childhood learning. A local museum or library could follow up with public lectures and discussions, hands-on exhibits and learning opportunities, and information resources for parents. IMLS has supported research on the potential for partnerships among museums, libraries, and public broadcasters through a National Leadership Grant to the Urban Libraries Council in 2000. The published report, “Partnerships for Free Choice Learning: Public Libraries, Museums, and Public Broadcasters Working Together,” is available on the Urban Libraries Council Web site www.urbanlibraries.org. IMLS co-hosted a conference with the Corporation for Public Broadcasting in November 2003 to further explore partnerships among libraries, museums and broadcasters. Recent research has shown that children can learn at much younger ages than has commonly been assumed, and that children younger than two years old have tremendous learning capacity (the same is true of adults over 70). In September 2003, IMLS co-hosted a conference on the 21st Century Learner that addressed these issues, in collaboration with the Association of Children’s Museums, the Association for Library Service to Children, the Civil Society Institute, and the Families and Work Institute.

Conclusion Technology can support and enhance learning for all age groups, and museums and library can play an important role in this emerging field of research and practice. Some educational applications will likely draw on library and museum content for use in homes, classrooms, and workplaces. Learning can also take place in libraries and museums, where specialized facilities may provide group learning spaces and cost-effective technology infrastructures. IMLS continues to explore and promote innovative learning applications, such as computer games and virtual reality environments, that will support the spectrum of learning from independent inquiry through formal

252

Connecting people and resources

Library Hi Tech

Joyce Ray

Volume 22 · Number 3 · 2004 · 249–253

education to the development of “learning communities.” The digital revolution has introduced many challenges for libraries and museums, but it has also brought opportunities. Technology holds the promise that libraries and museums will not only remain essential for their traditional roles but also develop new services and new allies to become more relevant than ever in the Information Age. Libraries, museums, archives, and related institutions are charged with stewardship of the world’s knowledge. They promote the democratization of information and provide opportunities for individual enrichment and community engagement. Cultural heritage institutions are demonstrating that, with digital technology, they will make their resources accessible through new channels, for use by wider audiences, and with greater impact.

applications. It also provides access to relevant IMLS publications, information about IMLS-supported conferences, and announcements of initiatives relating to libraries, museums, and digital technology.

Notes

Further information IMLS has created a Digital Corner on its Web site (www.imls.gov) to make information about its digital activities more easily accessible. The Digital Corner provides one-stop access to information about IMLS-funded digital projects including digital content, research and tools, and learning

253

1 Colorado Digitization Program Homepage, available at: www.cdpheritage.org 2 See for example, selected papers from Web-Wise 2003, published in First Monday 8, no. 5 (5 May 2003), available at: www.firstmonday.org/issues/issue8_5/index.html 3 “A Framework of Guidance for Building Good Digital Collections” available at: www.niso.org/framework/ forumframework.html 4 INFOMINE Homepage, available at: http://infomine.ucr. edu/ 5 IMLS Digital Collections and Content project Homepage, available at: http://imlsdcc.grainger.uiuc.edu/ 6 National Science Digital Library Homepage, available at: www.nsdl.org/ 7 Homepage for Boston Streets: Mapping Directory Data, available at: http://nils.lib.tufts.edu/bostonstreets/ 8 Homepage for Voices of the Colorado Plateau, available at: http://archive.li.suu.edu/voices/ 9 The Maine Music Box Homepage, available at: http:// mainemusicbox.library.umaine.edu/ 10 vPlants Homepapge, available at: www.vplants.org/ 11 Walt Whitman Archives Homepage, available at: www.whitmanarchive.org/ 12 Linking Florida’s Natural Heritage, available at: http:// palmm.fcla.edu/lfnh/

Introduction

The Colorado digitization program: a collaboration success story Brenda Bailey-Hainer and Richard Urban

The authors Brenda Bailey-Hainer is Director of Networking and Resource Sharing based at the Colorado State Library, Denver, Colorado, USA. Richard Urban is Operations Coordinator based at the Colorado Digitization Program, University of Denver, Penrose Library, Denver, Colorado, USA.

Keywords Libraries, Museums, Cultural synergy, Heritage

Abstract The Colorado Digitization Program has received several IMLS Leadership Grants. The Heritage Colorado and Western Trails grant projects both involved extensive collaboration between libraries, museums, historical societies and archives. Successful collaborative activities included creating best practices, metadata and scanning standards, training, metadata input tools, technological interoperability, and funding strategies.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

The Colorado Digitization Program (CDP) was established in 1998 through a Library Services and Technology Act (LSTA) grant through the Colorado State Library. The vision of the CDP is to provide access to the written and visual record of Colorado’s history, culture, government, and industry. From the very beginning, the CDP was a collaborative organization that embraced membership and participation from all cultural heritage institutions – libraries, museums, historical societies and archives. Although focused originally just on Colorado, the CDP has expanded to work with other Western states. Early on, the CDP embraced participation from Wyoming institutions because of the on-going resource sharing relationship between the two states. In addition, the focus of some projects merited collaboration with other states – everything from simply working together to agree on standards to actively cooperating on projects whose intellectual scope was broader than a single state. Part of the charm of working with the CDP is the challenges and rewards of facilitating the cooperation of such a broad range of types of organizations covering such a broad geographic area. Different types of organizations have different standards and culture, as do different states and regions of the United States. Over the course of the last five years, the CDP received a number of IMLS National Leadership grants that all involved different levels of collaboration with different types of entities. These grants included Heritage Colorado (1999), Western Trails (2001), Teaching with Colorado’s Heritage (2001), and Colorado’s Historic Newspaper Collection (2003). Each of these projects had unique challenges related to collaboration. This paper will focus on the experiences with two of them: Heritage Colorado and Western Trails.

Heritage Colorado The first project for which CDP received funding was Heritage Colorado, a museum-library collaboration. This project focused on development of a model of library-museum collaboration for creating digital resources that is still in use in Colorado today. The result of this project was the creation of over 47,000 images, through mostly library-museum partnerships. Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 254–262 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560044

Received: 10 March 2004 Revised: 15 April 2004 Accepted: 11 June 2004

254

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

All types of cultural heritage institutions of all sizes were encouraged to participate in the project and to contribute metadata records describing their digital materials to a central database. This database, called Heritage Colorado, was hosted on a server owned and maintained by the Colorado State Library. The grant also established a data archive, paid for technical support from the Colorado Alliance of Research Libraries (Alliance), and training for cultural heritage staff throughout the state. Dublin Core metadata standards that were collaboratively developed by the cultural heritage institution community in Colorado were tested as part of the project. In addition, it supported the testing of the best practice documents that were developed under an earlier Colorado LSTA grant in the areas of metadata and digital imaging during 1998. Even the funding of the project illustrated the collaborative nature of the endeavor. An LSTA grant from the Colorado State Library covered part of the cost of regional scan centers placed throughout the state; an expansion to the State Library’s existing SiteSearch software license used to create the Heritage Colorado database; and technical support through the State Library. Colorado has seven regional library service systems funded through the state’s general fund. Contributions from these seven library systems provided money for mini-grants to individual institutions for local digitization. One of the most important outcomes of the project was not just the creation of the Heritage collection but also the partnerships and relationships that were formed and the establishment of the regional scan centers that expanded the numbers and locations of potential partner institutions by making scanning affordable and geographically accessible. Through these collaborative efforts, the purpose of the original Colorado LSTA grant was achieved: to increase access to unique resources and special collections of Colorado’s cultural heritage institutions through digitization.

museums and small historical societies that did not have any type of automation, but the majority of participants had something in place. Most institutions already had a Web presence as well, with either a general informational Web site about their institution or access to their collection. One unexpected finding was that there was no great hue and cry for encoded archival description (EAD); no one surveyed had implemented an EAD program yet. The timing of the project was serendipitous. The group had the luxury of very few legacy records to deal with. Two libraries in Colorado, Denver Public Library and Boulder Public Library, had done some digitization creating records using a proprietary software product that was available through the CARL Corporation, their ILS software vendor. The State Library was in the process of creating the Colorado Virtual Library (CVL). CVL was being designed to search across multiple library catalogs as well as locally developed databases describing selected Web sites. Since Heritage Colorado was being developed at the same time with some of the same staff assigned to both projects, the indexed and searchable fields were coordinated across the different types of resources from the outset. Although the group was interested in newspaper files and anticipated digitizing newspapers at some future time, after extensive discussion, newspapers were not included since they have special needs and were not expected to be a part of the Heritage Colorado project.

Needs assessment The first step in the Heritage Colorado grant was to do an environmental scan of the technology and knowledge at the participant institutions. The results of the environmental scan were used to guide the metadata working group in developing standards. The environmental scan showed an adequate level of automation among those, doing digitization. Libraries had some type of integrated library system (ILS), while museums had either Access databases or collection management systems (CMS), such as PastPerfect and Argus, to manage collections. As expected, there were some

Standards and software Representatives from all of the different types of cultural heritage institutions met with technical staff from the State Library and the Alliance to look at standards that were in use in early 1999 among the professional communities that were creating metadata for digital objects. The group created a chart that showed a matrix comparing different systems (Table I), showed the lowest common denominator, and then as a result, picked Dublin Core as the standard for Colorado. The pre-existing records at Denver Public Library and Boulder Public Library were in MARC format, but the group felt that these records could be incorporated into the project through a cross database searching interface. Because there was no one common metadata entry or cataloging system in use among all of the institutions involved, the group decided that it was unfair to ask participants to do double entry in two separate cataloging systems. Many of these organizations did not have sufficient funding to change systems and it was unlikely that they would move to a new system just for digital objects. In fact, many wanted simply to add a URL to the

255

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

Table I Title and author sample from matrix of common standards MARC

CDWA

DC

FGDC

GILS

REACH

VRA

Title; Object Name Title

245 ja

Titles or Names-Text

Title

Title [8.4]

Title

Title [W2]

Author; Creator; Originator; Maker

1xx, 7xx, +1xxe, 7xxe

Creation; Creator; Identity; Names[*]; Role

Author; Creator; Originator [8.1]; Other Contributors Data set credit [1.11]

Object name; Title [#4] Creator; Maker [#10]

original cataloging or inventory control system. It became clear that a union catalog should be created that records could be loaded into. The decision to use Dublin Core as the standard was reinforced because Dublin Core was most hospitable for loading of records from multiple systems. In addition, the Colorado State Library pledged technical support for the union catalog, and offered use of the software they used to create local databases for other projects. As a result, the group decided to leverage the existing software and take advantage of the State Library’s offer to use SiteSearch WebZ and Database Builder (originally an OCLC product, now available as open source[1]) as the platform to integrate all of the records. In 1999, this was one of the few available systems that could host both Dublin Core and MARC records. The SiteSearch software came with a component called Record Builder, but the group found it unsatisfactory for the project. The Dublin Core component was not robust enough to support the number of simultaneous users for high volume multiple institution activity, and the interface was oriented more for library catalogers and not easily understood by staff at the nonlibrary cultural heritage institutions. In addition, it could not accommodate crosswalking or mapping MARC records to Dublin Core. As a result, the Alliance staff developed a prototype Dublin Core system called DC Builder, using mySQL and Macromedia ColdFusion. DC Builder supported the conversion of metadata from a variety of systems and formats to Dublin Core, as well as the direct entry into a DC formbased application. After locally created records were imported to DC Builder, staff at the institutions could make modifications to the records using DC Builder. The records that were created through DC Builder were exported as SGML files, then loaded into the Heritage SiteSearch Database Builder database using the standard SGML loader which comes as part of the software. Figure 1 shows the record loading and user access process. CDP and the University of Wyoming, Coe Library staffs are

Originator; Contributor

Creator [W6], [W7]

completing work on a Dublin Core to MARC export that will allow records created in DC Builder to be imported into a Library Management System. The new export utility will be completed in early 2004. DC Builder originally included both short and long record formats to accommodate the beginning and the advanced user. However, later versions of DC Builder simplified this into a single form. In addition to online entry, there are also crosswalks for the loading of MARC and nonMARC records coming from a variety of systems. Once records are loaded, they are all available for editing through DC Builder. User logins allow institutions to view all records in the database, but only records created by the institution can be updated or deleted. Users are also able to indicate when the records are ready to be exported to the Heritage Colorado database. Before records are loaded into DC Builder, institutions are required to complete a crosswalk to ensure local data is mapped into the appropriate Dublin Core field. While CDP provides a sample template (Table I) for a DC to MARC crosswalk each template is modified to accommodate local cataloging practices. For MARC records DC Builder uses a PERL module[2] to convert MARC output into Dublin Core Records. The PERL module is able to map MARC records at the tag and subfield level. Due to the labor involved current practice is to avoid byte counting in MARC records whenever possible (a notable exception is the MARC 008 language code). The PERL script also applies Dublin Core Refinements and Schemes as specified in the crosswalk or in MARC subfields. DC Builder is able to load non-MARC records in various formats (Microsoft Accessw, Microsoft Excelw, FileMakerPro, and delimited text files). Museum collection management system records are normally exported in one of these common formats for delivery to CDP where they are converted to temporary mySQL tables. A Database to DC crosswalk is created from a template and Macromedia ColdFusion scripts are used to load records into DC Builder’s record format.

256

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

Figure 1 System architecture for heritage Colorado

Table II shows an example of a customized MARC to Dublin Core crosswalk. The CDP Metadata Working Group has created a number of metadata case studies that provide in depth, information about local practices and methods of converting records to the Dublin Core format. Case Studies are available on the CDP Web site[3]. One of the advantages of using the SiteSearch WebZ for searching is that it allows extensive customization of the interface. CDP worked with the State Library’s designer to create a lush, colorful interface. The images in the original design structure were later updated by CDP staff (Figure 2). The flexibility of the software allowed project staff to create a design that accommodated the broad range of audiences that are targeted by a collaborative project. The interface supports both searching through a simple keyword entry box as well as pre-coordinated searches that allowed the highlighting of the strengths of the collection as it grew gradually over time. With such a wide variety of institutions participating, there was naturally concern over the subject headings and descriptors that were assigned to records. Since no true Colorado thesaurus existed, CDP staff harvested subject headings from the Prospector database, a library

union catalog of large research institutions in Colorado that is operated by the Alliance. This list of headings was incorporated into the DC Builder software to allow cultural heritage institution staff to select standard terms for use in their Dublin Core records.

Interoperability Z39.50 was selected as the common denominator for searching. It allowed the incorporation of the pre-existing collections by using Z39.50 to search both MARC and Dublin Core records simultaneously. Another reason for choosing Z39.50 was the existing philosophy that was part of the Colorado environment. The Colorado Virtual Library, created and supported by the Colorado State Library, was designed to allow users to search across library catalogs and databases describing Web sites with a single, integrated search. The digital resources in Heritage Colorado became searchable in that same environment. In 2003, the University of Illinois library assisted CDP in making DC Builder an Open Archives Initiative provider as part of the IMLS funded IMLS Digital Collections and Content (DCC) OAI Repository. DC Builder records are

257

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

Table II Customized MARC to Dublin Core Crosswalk Colorado digitization project: MARC to Dublin Core Crosswalk template Institution name: Institution code: Date:

Initials:

DC element

DCRefinement

Title

Creator

Subject

Description

Publisher Contributor

Digital Date Original Date

Type Creation format Use format Identifier

Source Language Relation

MARC tag 245 440 246 130 730 740 242 100 110 111 600 610 611 630 650 651 655 500 520 521 530 545 300 260 700 710 711 720 533 260 518 541 307 260 567 516 856 538 856 20 22 24 37 786 534 8 41 787 776 773 774 775 786

MARC subfields

DC scheme

a Alternative Alternative Alternative Alternative

abcdemnopqrstvxyz ab

subfield subfield subfield subfield subfield subfield subfield

2 2 2 2 2 2 2

b

d c

Created Issued Valid Available Available Created Valid

g

q

IMT

u ISBN ISBN

o bytes 35-37

Has Format Is part of Has part Has version References

(continued)

258

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

Table II DC element

Coverage Rights Dewey decimal

MARC tag

MARC subfields

510 538 490 780 785 522 506 540 82 92

DCRefinement

DC scheme

Is referenced by Requires Is part of Replaces Is replaced by Spatial

Dewey Dewey

Figure 2 Heritage Colorado SiteSearch WebZ interface

now being harvested, by the IMLS DCC project, University of Michigan’s OAIster and OCLC. The CDP working groups continue to discuss metadata issues, and are forging ahead with creating best practices for new formats. The CDP Digital Audio Best Practices (currently in draft) is one example of this[4]. The plan is to continue to expand the collection in Heritage Colorado through the addition of more images from current participants, acquiring new participants, and adding new formats.

Western trails The Western Trails grant in 2001 went well beyond the Colorado state borders and encompassed both multi-state and museum-library collaboration. In particular, one of the main tenets of the grant was to test how adaptable the Colorado library-museum collaboration model was to other states.

CDP partnered on the Western Trails grant with Nebraska, Kansas, and Wyoming. This grant had several objectives. The first was to have those states look at the Colorado model and determine how much of it could be adopted. Each of the other states had different demographics, a different museum-library environment, population distribution, and level of collaboration that already existed within the state. Some states had multitype regional library systems, while others had public library focused systems. The second objective was to test the adaptability and feasibility of adoption of the best practices across multiple states and to make any necessary updates to both the digital imaging standards as well as metadata best practices. Thirdly, the project was to address the issue of interoperability through the creation of a virtual cultural heritage database, using Z39.50 with distributed images. Finally, the project was to look at creating a single Web exhibit on the topic of Western Trails, drawing content from more than 30 participating institutions. Best practices and standards The Western Trails project involved some new challenges. States other than grant partners were interested in the project, and they were invited to working group meetings to discuss standards and technology. Representatives from Colorado, Kansas, Nebraska, Utah, New Mexico, and Minnesota revised CDP’s earlier Dublin Core Guidelines into the Western States Dublin Core Metadata Best Practices (WSDCMBP)[5]. A similar digital imaging working group reviewed current national practices and the CDP’s Scanning Guidelines to create the Western States Digital Imaging Best Practices that provides minimum benchmarks for image resolution and quality[6]. Project management Project management for Western Trails proved more complex than earlier CDP projects. Subgrants from CDP to lead state agencies were passed

259

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

on to local institutions as mini-incentive grants for digitization activities. The grant also included funds for training of local institution staff in the areas of digital imaging and metadata best practices and standards. Each state developed its own model for aggregating the records into a single database. The role of the Colorado State Library and the CDP was to create a single interface that searched across all four state databases using Z39.50 as a searching standard. CDP also received a grant from Colorado LSTA funds to expand the Colorado collaboration and to add additional infrastructure for storing master digital images and metadata.

Library and CDP staff used the working group’s recommendations to develop a central Western Trails SiteSearch interface that utilizes Z39.50 to search all of the state databases and return a single results list. Similar to the Heritage Colorado interface, Western Trails offers users precoordinated searches on common keywords and subject terms in addition to the ability to search individual institution records. The Western Trails project Web site is available[7]. Based on a number of focus group research reports conducted by CDP under the 1999 IMLS National Leadership Grant, the Western Trails website also includes overview exhibits that provide users the background context of the aggregated collections (Fry and Lance, n.d.; Loomis and Elias, n.d.). Overview exhibits highlight the various themes represented by the digitized collections and offers users intellectual links between physically separated collections, including Native American Trails, Pioneer Trails, Tourist Trails, etc.

Software and Interoperability Each state had its own set of challenges. In Nebraska, the Love Library at the University of Nebraska – Lincoln took the lead by acquiring a server using the Zebra general-purpose structured text indexing and retrieval engine, available freely under the GNU General Public License. Library staff created data entry tools for Zebra and installed a Z39.50 server for remote access to Nebraska records. Over the course of the Western Trails project, the Wyoming State Library was in the process of moving from a Dynix system to a SIRSI system. Dublin Core metadata for Wyoming state projects was created using SIRSI’s Hyperione Digital Media Archive. Unfortunately at this time, Hyperione does not support Z39.50 searches from external servers, although records are available to users of the SIRSI ILS. The University of Wyoming contributed their records to the project through DC Builder and the Colorado Western Trails SiteSearch database. Institutions in Kansas and Colorado used the DC Builder application service to create new records and import locally created records. Using existing DC Builder/SiteSearch workflows, records are exported to separate Colorado and Kansas SiteSearch databases hosted by the Colorado State Library. Although Utah was not part of the original Western Trails IMLS National Leadership Grant, the J. Willard Memorial Library at the University of Utah provided access to records created in ContentDMe, through ZContent,an open-source Z39.50 server that they developed using PERL scripts and modules. A Western Trails Interoperability Working Group reviewed the requirements for Z39.50 connectivity and established common indexing and search protocols. The Interoperability Working Group also recommended modified SiteSearch record display labels that are more easily understood by general users than default Dublin Core element names. Colorado State

Collaboration pros and cons A wealth of materials became available as part of the projects, particularly in drawing together the collections from the small institutions and diverse institutions. There is often no other way to combine the collections of these small, remote museums, historic societies, libraries and archives together with collections from larger research collections. The Western Trails project has demonstrated that collaboration can aggregate and increase access across diverse collections and extend the capabilities of local cultural heritage organizations through shared experience and infrastructure resources. Metadata about local collections can now be shared on state wide, regional, and national levels. Another positive aspect of the CDP collaborative environment is a growing community committed to other forms of online collections access. The same people who bought into the Heritage dream are now supporting the Colorado’s Historic Newspaper Collection(CHNC) project. For example, the Littleton Historical Museum had not collaborated with the library community prior to their participation in CDP’s Colorado’s Main Streets: Virtual Walking Tours project. Given the opportunity to participate, they have become the strong supports of CDP’s best practices and major contributors to Colorado’s Historic Newspaper Collection. CDP’s collaborative approach proved to be scaleable because CDP was always cognizant of the enormous potential in going beyond the

260

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

Colorado borders, and projects were designed with technology that could scale to that purpose. It allows for a great deal of flexibility because it is a distributed system that allows for branding and control of images and metadata at the local level, while still allowing exhibits and searches across the collective whole. One of the real challenges of collaborative work among different cultural heritage institutions is that it is time consuming. There are no shortcuts for building a common understanding of terminology, project priorities, and trust. CDP spent the better part of a year meeting with representatives from different cultural heritage institutions to gather information about each type’s practices and to build a collaborative – rather than just a cooperative – environment. Higher priorities at the home institution and lack of technical support, particularly among smaller institutions, often pushed digitization projects back. Several projects took months or even years longer than expected and required more involvement by CDP staff to provide special assistance in converting records. Despite training and best practices offered by CDP, local project staff still faced challenges in understanding the impact of local practices in a shared environment and adequate time and project management for overcoming technological hurdles. One of the key project management issues was the timing of training as it related to actual digitization work. Digitization workshops were offered early on, in the project, but frequently it took several months for projects to select collections for digitization and develop local workflows before beginning scanning and metadata creation. This lag time often meant that lessons learned in the early training sessions had to be re-learned. Staff turnover also delayed many projects, as it was necessary to bring new staff up to speed on best practices. Poor communication between project managers who attended training and staff, completing the work also led to several delays because projects had to re-do work where best practices were ignored or misinterpreted. One challenge that still has not been adequately resolved is the problem of vocabularies used by different types of organizations, particularly for subject terminology. For example, Florissant Fossil Beds National Park participated in the original Heritage Colorado project. Since paleontologists were the audience for the original database describing their extensive fossil collection, Florissant staff used scientific taxonomies as the primary access point. However, a layperson trying to find images of fossils in Heritage Colorado was likely to search by very

general keywords like “fossil” – which Florissant had not assigned to the records. Similar problems occur when importing museum collection management records that lack standardized controlled vocabularies or use taxonomies with terms unfamiliar to the general public. CDP staff and contract programmers also faced a learning curve developing accurate Dublin Core crosswalks and communicating requirements with local institutions. As the collaboration matured, CDP was able to develop improved methods of creating records which come closer to meeting best practice guidelines. For example, some institutions are unable to capture all required Dublin Core elements in local metadata systems. CDP is now able to work with local institutions to identify appropriate values that can be appended to locally generated records.

Collaboration and the future While we have conquered some of the original challenges in interoperability and collaboration, with technology there is always a new horizon to pursue. There will continue to be a need for enhancing the Dublin Core standard as different material types are converted to digital and CDP members gain more experience with formats such as audio and video. Other eXtensible Markup Language (XML) standards such as the Metadata Encoding and Transmission Standard (METS) and Encoded Archival Description (EAD) require additional exploration to be integrated into the CDP environment. As commercial vendors expand their product offerings related to Dublin Core, the CDP will need to make a decision about whether to continue maintaining its own custom system. Because the CDP and the Colorado State Library work so closely together on projects, their technology is very interwoven. Any technology issues are not just the CDP’s to decide; both the CDP and State Library will have to make decisions together. The CDP is already facing new challenges. As the most recent IMLS funded project, Colorado’s Historic Newspaper Collection, takes shape and grows into a major resource, staff must consider how it can be integrated with the Heritage interface. Current plans call for creating high level metadata records for each newspaper title and including them in the Heritage Colorado database, but the access be provided by adding article level records as well. Since the newspaper collection will include over 1.6 million digitized pages, the sheer volume of individual articles could overwhelm and dwarf the other materials referenced in the database. Should federated searching be explored

261

The Colorado digitization program

Library Hi Tech

Brenda Bailey-Hainer and Richard Urban

Volume 22 · Number 3 · 2004 · 254–262

using Z39.50 or other means since the Olive software in use for the newspaper project does not support Z39.50 or OAI harvesting? At the Colorado Digitization Program, we believe that we have created a rich collection of digital treasures that will continue to grow, no matter how the technological solutions evolve. But more importantly, we have created and nurtured a collaborative environment between cultural heritage institutions throughout Colorado – as well as within and between many other states – that will continue to grow as well. As Joey Rodger from the Urban Libraries Council so aptly put it during her presentation at the IMLS Web-Wise 2004 Conference: “One, two, three, four – who are we building it for? Five, six, seven, eight – if it makes sense, collaborate!” In our minds, it always does.

Notes 1 OCLC SiteSearch open source project, available at: www.sitesearch.org/

2 MARC/PERL Library, available at: http://marcpm. sourceforge.net/ 3 Colorado Digitization Program. Metadata case studies, available at: www.cdpheritage.org/resource/metadata/ 4 Colorado Digitization Program. Digital audio best practices guidelines, available at: www.cdpheritage.org/resource/ audio/std_audio.htm 5 Colorado Digitization Program. Western states Dublin Core best practices, available at: www.cdpheritage.org/ resource/metadata/wsdcmbp/index.html 6 Colorado Digitization Program. Western states Digital Imaging best practices, available at: www.cdpheritage.org/resource/scanning/index.html 7 Western Trails project Web site, available at: www.cdpheritage.org/westerntrails/

References Fry, T. and Lance, L. (n.d.), A Comparison of Web-Based Library Catalogs and Museum Exhibits and their Impacts on Actual Visits, unpublished report. Loomis, R.J. and Elias, S.M. (n.d.), Website Availability and Visitor Motivation: An Evaluation Study for the Colorado Digitization Project, unpublished report.

262

Metadata rematrixed: merging museum and library boundaries Priscilla Caplan and Stephanie Haas

The authors Priscilla Caplan is based at the Florida Center for Library Automation, Gainesville, Florida, USA. Stephanie Haas is based at the University of Florida Libraries, Gainesville, Florida, USA.

Keywords Museums, Taxonomy, Z39.50, Geographic information systems

In 1957, Karl Schmidt, the renowned herpetologist of Chicago’s Field Museum, wrote in Curator, “Not everyone realizes how different the use of books may be in a museum from the familiar pattern of reading and note-taking in a public library. In a museum, a book may be. . .tested by reference to a specimen or a series of specimens drawn from the range and laid beside it.” Schmidt’s comment on the use of scientific literature in natural history museums inspired the IMLS grant-funded project entitled “Linking Florida’s Natural Heritage: Science & Citizenry.” The Linking project was intended to use the specimen as the nexus for pulling together scientific data from museum specimen databases and library catalogs of scientific literature. The goals of the project were to integrate specimen records and bibliographic records about the same species; to create an interface equally easy for scientists, students and laymen to use; and to enhance bibliographic description to make it more usable in a taxonomic and environmental context.

Abstract Linking Florida’s Natural History uses species information as the nexus for pulling together scientific data from museum specimen databases and library catalogs of scientific literature. The goals of the IMLS funded project were to integrate specimen records and bibliographic records about the same species; to create an interface equally easy for scientists, students and laymen to use; and to enhance bibliographic description to make it more usable in a taxonomic and environmental context. Although some development was required to enable Z39.50-based broadcast search across bibliographic and specimen collections, the bulk of the work was devoted to identifying and overcoming inconsistencies between the resource description practices of libraries and museums. Enriching records with taxonomic and geographic information was also a challenge.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 263–269 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560053

Project deliverables Linking Florida’s Natural Heritage provides a single interface to disparate collections[1]. The original grant focused on linking two specimen collections from the Florida Museum of Natural History, ichthyology and herpetology, and four bibliographic files describing scientific literature. Three of these files were pre-existing: the FORMIS Ant Bibliography, the Archie Carr Sea Turtle Bibliography, and Everglades Online, a database of citations on the Everglades ecosystem produced by the Everglades Information Network. The fourth file, called Florida Environments Online, was created especially for the project. It initially consisted of 13,380 citations from eight research bibliographies compiled by scientists and state agencies in Florida. An online form was created to allow researchers and agency staff to add to the database and a workshop was held in January 2000 to train contributors in data entry. To facilitate searching across different environmental citation databases, the project also developed a thesaurus of Florida environmental terms in cooperation with the Florida Geographic Board, the Florida Department of Environmental Protection, the South Florida Water Management District, the Florida Marine Research Institute, the US Geological Survey (USGS) and others. Terms were taken from ENVOC (a multilingual Received: 15 January 2004 Revised: 13 April 2004 Accepted: 13 June 2004

263

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

thesaurus of environmental terms), the Fish and Wildlife Reference Service thesaurus, Aquatic Sciences and Fisheries Abstracts, Fire Ecology Thesaurus, California Resources Agency Environmental Resources (CERES) Thesaurus, the Florida Natural Areas Inventory classification, the Florida Department of Transportation’s Florida Land Use, Cover and Forms Classification System, and the Federal Wetlands Classification. Gail Clement, who was at Florida International University during the grant period, compiled and harmonized the thesaurus database that now contains 4,376 terms in seven major subject categories. The thesaurus should be considered a work in progress and is available online[2]. Also as part of the project, a “core collection” of 200 seminal texts on species and ecosystems was selected by scientific experts throughout the state of Florida and these materials were digitized specifically for the Linking project. Catalog records for the materials were added to the Florida Environments Online database. Finally, four educational modules were written and tested in local schools, using the Linking interface for lessons in taxonomy, invasive species, biodiversity, and library and museum collections. Architecturally, the application is straightforward. A Z39.50 gateway product, OCLC’s SiteSearch, was used to create a user interface offering broadcast searching (now called “metasearch”) of the specimen and literature databases. The interface is shown in Figure 1. It is configured so that a user can search all of the databases at once, choose literature or specimens only, or select any combination of databases to search. Search results are aggregated by

the database in which they were found, but displayed in a common labeled format. The selected databases of literature citations were already in MARC format and accessible through library catalog systems. Because of this, implementing access to them was a relatively simple matter of establishing a Z39.50 connection to the databases. Records that were not in Z39.50accessible systems were copied into a local library system with a Z39.50 server. The two specimen collections, however, were in Microsoft SQL Server databases at the Florida Museum of Natural History (FLMNH). Linking to these required local development of a Z39.50 server that could translate the SiteSearch client’s query into SQL queries and search the relational databases via ODBC. The tables constituting the result set were then mapped to MARC format records for display. Museum specimen databases have some commonalties, but are not standardized to the extent that libraries have standardized on MARC. To facilitate the integration of additional specimen databases, we decided to specify a standard view that participating museums would be required to provide. (In relational database technology, a “view” is a special organization of data, created as needed from the base tables.) The view was based on a set of metadata elements for specimen collections developed by the ZBIG (the Z39.50 Biological Implementers Group out of the Museum of Natural History, University of Kansas) known as the “Darwin Core.” For the Linking project, the elements of interest included the specimen identifier and location, taxonomic information, and the collector, date, and place collected. Although architecturally straightforward, the application is complicated by incompatibilities between the library and museum data. Information elements unique to one category or the other, or elements that were not central to our purpose, were not a problem. For example, we made no attempt to reconcile the form or content of author names and collector names although we are aware that collectors are often authors of articles related to their taxonomic interests. However, we felt that it was critical for species names and geographical location information to be cross-searchable in all of the included databases. Different practices in library and museum communities made this difficult.

Figure 1 Interface to the specimen and literature databases

Species names Traditionally, taxonomists who work with museum specimens have used a binomial nomenclature

264

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

refined by Linnaeus. It consists of a two-part Latin name formed by appending a specific epithet to the genus. “The advantage of scientific names is less their stability than that they are the same no matter what the user’s language and that changes in them are governed, in principle, by an internationally adopted Code” (Walker and Moore, n.d.). Thus, scientists around the world know that Grus americana is the scientific name for the bird commonly known as the Whooping crane. What somewhat complicates the picture is that some species have been reclassified and renamed and this can lead to multiple synonyms (previously valid names). While taxonomists try to keep up with all the scientific name changes, specimen records may or may not be updated as names evolve. It should also be noted that current technological advances in determining relatedness of species through genetic mapping might eventually disrupt the hierarchical relationships implied in the species epithet. Nonetheless, the names will likely be retained for their value in identification. Unlike museums, libraries follow the Library of Congress and use common names in their cataloging records. This is clearly an attempt to serve a broader audience of library users than taxonomically literate scientists. However, common names differ even more than scientific names, often depending on the language of the speaker and where the specimen was collected. Thus, a species may have multiple common names and multiple scientific names. The first conundrum of the project was how to search simultaneously museum records containing scientific names, but no common names, and library records containing common names, but no scientific names. The initial approach was to add scientific names to the library citations, using the Integrated Taxonomic Information System (ITIS). ITIS is a multi-agency project intended to provide a taxonomic authority for the US biological species spectrum. The ITIS database[3] can be searched by common name or scientific name and returns the full taxonomic hierarchy for the species. Common names found in bibliographic records were searched in ITIS and the taxonomic information was cut-and-pasted into the MARC 754 field (Added entry – Taxonomic Identification). At that time the 754 was defined to contain a repeatable subfield “a” for the taxonomy and a subfield “2” for the taxonomic authority. The example given in the USMARC documentation was: 754 ## $aPlantae (Kingdom) $aSpermatophyta (Phylum)

$aAngiospermae (Class) $aDicotyledoneae (Subclass) $aRosales (Order) $aRosaceae (Family) $aRosa (Genus) $asetigera (Species) $atomentosa (Variety). $2 [code for Lyman David Benson’s Plant Classification] The Linking project adapted this to a format more easily cut-and-pasted from ITIS, and added common names and synonyms. A populated 754 field looked like this: 754 ## $aKingdom: Animalia $aPhylum: Chordata $aSubphylum: Vertebrata $aClass: Chondrichthyes $aSubclass: Elasmobranchii $aSuperorder: Selachimorpha $aOrder: Carcharhiniformes $aSuborder: Scyliorhinoidei $aFamily: Carcharhinidae $aGenus: Carcharhinus $aSpecies: Carcharhinus limbatus $aAuthor: (Valenciennes, 1839) $aCommon Name: blacktip shark $aSynonym(s): Carcharhinus natator, Carcharias aethlorus, Carcharias ehrenbergi, Carcharias limbatus, Carcharias microps, Carcharias muelleri, Carcharias phorcys, Carcharias pleurotaenia, Galeolamna pleurotaenia tilsoni, Gymnorhinus abbreviatus, Gymnorrhinus abbreviatus $2Integrated Taxonomic Information System, 6/29/99 Experience with this data exposed problems with the way the MARC field was defined. Because subfields did not map directly to taxonomic levels, it was difficult to construct a search for genus name or species name only. Also, embedding labels in the data removed all flexibility from display. The most common form of display, the genus and species name together (e.g. “Rosa setigera”) required extensive parsing to create. These problems led us to submit a proposal to the USMARC Advisory Group in 2000 that was eventually amended and approved in 2001. The currently defined 754 field has a repeatable subfield “c” used to label the taxonomic level in the following subfield “a”. The example given in the MARC21 documentation is now: 754 ## $ckingdom $aPlantae $cphylum $aSpermatophyta

265

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

$cclass $aAngiospermae $csubclass $aDicotyledoneae $corder $aRosales $cfamily $aRosaceae $cgenus $aRosa $cspecies $asetigera $cvariety $atomentosa $2[code for Lyman David Benson’s Plant Classification ]

Geographic locations

Potential populations are forecast by matching environmental parameters supporting known populations to geographic areas with similar characteristics. In the case of museum specimens, locations are recorded when the specimen is collected. Today, the widespread use of global positioning system units (GPS) leads to accurate location information. However, historic locations are often open to interpretation and accuracy is usually qualified in terms of reliability. Museums usually record a hierarchical structured location: region, country, state or department, county or province. This may or may not be accompanied by a fuller textual description of the collection site, e.g. “ca. 1/3 mi. w. of NW 98th St., and ca. 2 mi. w. of I-75 and Gainesville, S6, T10S, R19E.” Work being done to convert descriptions into spatial footprints is described under “Future plans reflect current initiatives.” In the case of marine collections, the location can also include drainage, e.g. Lower St Johns River. Thus, geographic data associated with a Gulf sturgeon could be: North America (Region), USA (Country), Florida (State), Putnam (County), Apalachicola (Drainage). In libraries, geographic information is given in the 651 field (Subject Added Entry – Geographic Name), but it is not given in a similar hierarchical format. A work on Gulf sturgeon in the Suwannee was assigned the subject heading “Gulf Sturgeon – Suwannee River (Ga. and Fla.)”. While specificity is high, a search for the state “Florida” would not retrieve this record, nor is the heading related to a county. In order to provide more consistency between the bibliographic and specimen data, a hierarchical form of name was added to the MARC records in the 752 field (Added Entry – Hierarchical Place Name). This field is most heavily used for places related to newspaper circulation area, but it is defined more generally to hold any “hierarchical form of place name that is related to a particular attribute of the described item, e.g. the place of publication for a rare book” (MARC21 Concise Bibliographic). In the Linking project, it was used to provide hierarchical geographic information related to the content of the text. For example, the book on Gulf sturgeon in the Apalachicola River was assigned the heading: 752:: ja United States jb Florida jc Franklin jd Apalachicola River

In addition to the scientific name, the collecting site is critical to a specimen’s scientific value. The first specimen of a species sets the loci on the earth’s surface for future questions on species distribution, possible genetic lineages, future species viability, and other important vectors.

Beyond the place names in the 752, additional geographical information was added to bibliographic records in order to anticipate their use in a GIS environment. We manually looked up the place names in the USGS Geographic Names Information System database to obtain latitude

Although this venture richly enhanced the taxonomic value of the MARC records, it proved too labor intensive to sustain in the long run. Also, although it allowed users to search for scientific names in bibliographic databases, it did not address the inability to search for common names in scientific databases. The museums had no desire to record common names in their specimen databases. Therefore, we turned to a second approach, creating a mapping table between common and scientific names. Dr Wayne King, Curator of Herpetology at the Florida Museum of Natural History, supervised the creation of a mapping table for approximately 6,800 species of mammals, birds, reptiles, amphibians, fish, and mollusks. The information is available in thesaurus form[4]. The way the application is currently implemented, the mapping table is invoked only when a user enters a common name search for data in a specimen database. The mapping table is joined (merged) with the specimen database view to create a new view in which the user’s term is searched. This solves the problem of the absence of common names in scientific data, but does not address the opposite problem, the lack of scientific names in bibliographic data. In addition, the mapping table does not cover all species included in Linking, especially as new specimen databases are added. These problems may be addressed by a third approach in the future, as noted below.

266

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

and longitude coordinates, which were entered into the MARC 034 field (Coded Cartographic Mathematical Data). We also entered standardized forms of county names and hydrological names taken from the Federal Information Processing Standard (FIPS) and the USGS Hydrologic Unit Codes (HUC), federal government standards maintained by the USGS. The 651 field (Subject added entry – Geographic name) seemed the most appropriate field for the HUC and FIPS data. The field is defined so that the authority for the data is given in coded form in subfield 2, but only if the source is listed in the MARC Code List. At the beginning of the project, the MARC Code List did not have codes for either HUC or FIPS, so we put the source of the term in a subfield x, with a second indicator “4” indicating, somewhat inaccurately, that the source of the term was unknown. 651: 4: ja Monroe jz 12087 jx FIPS 651: 4: ja Apalachicola jz 3130011 jx HUC

The digitized “core collection” has now grown far beyond the original 200 texts. The first 15 volumes of the Bulletin of the Florida Museum of Natural History are also available, as are some 338 publications and 126 maps of the Florida Geological Survey. Projects are underway to digitize the remaining Florida Geological Survey series, and to add the major series written by staff of the University of Florida’s Institute of Food and Agricultural Services and the Howard T. Odum Center for Wetlands. The number of databases available for broadcast search has also grown with the addition of one bibliographic database and six museum specimen databases. The new bibliographic file is the South West Florida Environmental Documents Collection covering Estero Bay, Charlotte Harbor, and the Caloosahatchee River. Specimen databases added include the mammalogy and herbarium collections of the Florida Museum of Natural History, the ichthyology collection from the Florida Marine Research Institute, the bird specimen collection from the Tall Timbers Research Station, and the bryophytes and lichens collection and vascular plants collection from Camp Blanding.

An advantage of this approach was that the data displayed in our system as a common subject heading (Monroe – 12087 – FIPS) making it clear to any geographically literate researcher that this was a FIPS code available for searching. Later, at our request, the Library of Congress added HUC and FIPS authorities to the MARC Code List. The same data are now coded: 651: 7: ja Monroe jz 12087 j2 ceeus 651: 7: ja Apalachicola jz 3130011 j2 huc The second indicator “7” designates that the source of the term is given in subfield 2. Unfortunately, because the FIPS county codes are defined in a document titled Counties and equivalent entities of the United States, its possessions, and associated areas, the code assigned to the FIPS codes was “ceeus”, a designation that nobody would know. A better code would have been “fips55”, which uniquely identifies the county codes among FIPS. We are planning to submit a change request to the Library of Congress.

Current status At this point in time, nearly seven years later, Florida Environments Online has grown from some 13,140 records at the end of the grant period to over 26,000 records in December 2003. Citations from other databases have been ingested, although no effort has been made to identify or remove duplicates, and indexing/cataloging idiosyncrasies of contributing agencies have been accommodated. Florida records from the Aquatic, Wetland and Invasive Plant Information Retrieval System (APIRS) (5,000 records) were added in 2003.

Usage statistics Project Web site. Use of the project Web site remains fairly constant with about 4,000 accesses a year from 2000 through 2003. The total number of searches recorded on the Z39.50 server is 20,278. Of particular interest is the use by institution. Access from the ten public universities in Florida are coded separately, all other users are lumped into an unknown category. Over the four years, the highest number of uses came from the unknown category, followed by use from the University of Florida and beginning in 2002 from Florida Gulf Coast University. Figure 2 shows usage of the site between the years 2000 and 2004. At this point in time, we have no data on the querying methods used by searchers using this interface nor do we know if the project is fulfilling their research or educational imperatives. As the statistic gathering becomes stabilized, we will probably explore some Web based surveys to determine how people are using this system and what suggestions they might have for improving it. We do feel that usage could be higher, given the utility of the application, and intend to pursue several avenues of promotion, including placing notices in agency publications, adding links to related Web sites, and publicizing the availability of the educational modules.

267

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

Figure 2 Use of Linking Florida’s Natural Heritage

Core collections. More than 200 digitized full text documents associated with the project were moved to a new server in 2002. Thus, consistent use statistics are available for 2002 and 2003 only. However, they are of interest because of the phenomenal increase in use in one year. In 2002, tables of contents of core collection documents were displayed 5,425 times; in 2003, this increased to 58,168 times. In 2003, the table of contents from Bulletin #3 of the Florida Geological Survey, “Miocene gastropods and scaphopods of the Choctawhatchee formation of Florida,” by W.C. Mansfield was viewed 1,101 times. In second place was Ida Cresap’s The history of Florida agriculture with 829 viewings. Florida Environments Online. Cumulative statistics for use of the Florida Environments Online database are available for three years. Transactions involving the database were recorded at 71,660 (2000/2001), 52,030 (2001/2002), and 193,543 (2002/2003).

Future plans reflect current initiatives Linking Florida’s Natural Heritage is a stable application offering unique functionality related to Florida’s species and ecosystems. It could remain as it is for some time, with most of our efforts directed to increasing awareness and use. At the same time, there are both technical and functional reasons to re-evaluate the architecture and content. The mapping table approach to relating common and scientific names requires the maintenance of a static local thesaurus. The way it is implemented, as an “under the covers” process,

means that unsuspecting users may get unanticipated or even incorrect results. The implementation also restricts its use to find common names for specimens. A better approach would make use of new developments such as uBIO, the taxonomic name server[5] being developed by Cathy Norton and Dave Remsen at Woods Hole. This initiative is creating a concordance of common and scientific names that is tied into the major taxonomic work being carried on by ITIS, the Global Biodiversity Information Facility (GBIF), Species 2000, and others. Integrative software is being developed so that individual projects such as Linking Florida’s Natural Heritage can make use of the powerful concatenating, taxonomic name server to help users formulate their queries. This would obviate the local maintenance problem, and add assurance that all identified name variations were included in user queries. Another architectural problem is that the way results (citations and specimen records) are returned as HTML makes it impossible to use the data in a geographic information systems (GIS) environment. For many researchers, the ability to plot a selected set of specimens on a map by their collection location would greatly enhance the value of the system. In order to do this, however, the result set would optimally be formatted in a tabular form and is made available as a downloadable file. The nature of the SiteSearch application precludes our supplying results in this way. Because of the research utility in applying GIS functionality to bibliographic works as well as natural history data, future enhancements of the project would facilitate spatial analysis of both forms of information. Two current projects: one

268

Metadata rematrixed: merging museum and library boundaries

Library Hi Tech

Priscilla Caplan and Stephanie Haas

Volume 22 · Number 3 · 2004 · 263–269

from the museum world and one from the library world are of particular interest. Recently, a new site developed by Tulane University’s GEOLocate[6] attempts to georeference specimens using textual descriptions. GEOLocate software is available for free download or it may be used through the Web interface. The description mentioned in this paper, i.e. “ca. 1/3 mi. w. of NW 98th St., and ca. 2 mi. w. of I-75 and Gainesville, S6, T10S, R19E.” produced a latitude of 29.65139 and a longitude of 282.3749051520174. The Perseus Digital Library[7] being developed at Tufts University creates spatial visualizations of document texts. While the current collaborations relate to mapping classical works, the functionality could be applied to natural history publications as well. Essentially, place names in a text would be displayed on a map and the user can shift between the text and the document. Ideally, shifts between maps, texts, and specimen data should all be supported. A major architectural redesign would be required to fix the limitations of either the local taxonomic/scientific name thesaurus or the lack of spatial analysis. If such a redesign were undertaken, we would probably also reconsider whether the application is best implemented through Z39.50 metasearching or whether another model is more appropriate. The original work of the Z39.50 Biological Implementers Group (ZBIG) also used Z39.50 as the communications protocol for integrating natural history collection data. Limitations they encountered included the following. . Complicated protocol specification means a very steep learning curve for developers. . Protocol not well understood by network administrators, and hence they are reluctant to open the necessary network port (even though Z39.50 is far less likely to allow a security breach than HTTP). . Conceptual schemas are not defined with a formal language such as XML Schema. . Limited support for XML and Unicode (although this has improved greatly over the last couple of years) (Relationship of DiGIR to Species Analyst, 2003).

One option is to investigate the SRW/SRU services developed as a kind of “Z39.50 Lite” by the Z39.50 International Next Generation (ZING) initiative. Another alternative is a protocol developed specifically for exchanging natural history collection data, Distributed Generic Information Retrieval (DiGIR). This protocol is “based entirely on the use of XML documents for messaging between clients and data providers, with a data transport mechanism that was predominantly based on HTTP. DiGIR is designed from the ground up to offer the same capabilities as Z39.50 except using simpler technologies and a more formal specification for description of information resources.” (Relationship of DiGIR to Species Analyst, 2003). DiGIR uses for its transport syntax an XML schema based on the Darwin Core called Access to Biological Collection Data (ABCD). It appears that the DiGIR protocol could be adapted to work with XML metadata for bibliographic works. The globalization and digitization of natural history information is leading to scientific analysis and research that was previously unimaginable. In the seven years since the Linking project was initiated, the cutting edge technologies of 1997 have rapidly given way to XML, harvesters and portals. New technologies in turn raise expectations for functionality. We prophesize that eventually, the rematrixing of metadata will lead to infinite integrations of text and other forms of data. One of the greatest challenges now is to be aware of digital project applications that already exist, and to conceive and design a system architecture that allow their integration to provide new functionalities.

While Linking Florida’s natural heritage encountered few problems using Z39.50 to integrate bibliographic collections, individual customized views had to be created for each specimen collection and the difficulties noted above occurred in the Linking project as well.

Notes 1 2 3 4 5 6

Available at: palmm.fcla.edu/lfnh/ Available at: palmm.fcla.edu/lfnh/thesauri/feol2/ Available at: www.itis.usda.gov Available at: palmm.fcla.edu/lfnh/matrix/T/ Available at: www.ubio.org Available at: www.museum.tulane.edu/geolocate/ default.aspx 7 Available at: www.perseus.tufts.edu/

References Relationship of DiGIR to Species Analyst (2003), available at: http://speciesanalyst.net/docs/digir/ Walker, T.J. and Moore, T.E. (n.d.), Singing Insects of North America, available at: http://buzz.ifas.ufl.edu/

269

Background

Online multimedia museum exhibits: a case study in technology and collaboration Matthew F. Nickerson

The author Matthew F. Nickerson is Professor based at the Southern Utah University, Cedar City, Utah, USA.

Keywords Grants, Libraries, Museums, Multimedia, Online operations

Abstract Eight partners including three university libraries and five regional museums worked together to create the Voices of the Colorado Plateau online exhibit. This site features multimedia exhibits that combine oral history recordings and historic photographs to create a new and engaging online museum experience. Computer and telecommunication technologies were vital in the collaboration, creation and dissemination processes. Collaborative projects among libraries and museums can capitalize on both similarities and differences between these culture heritage institutions. Working in consortia can produce results that cannot be achieved alone. Both number and geographical separation of the partners in this project represent a unique level of cooperation and integration. The extensive use of oral history in a multimedia museum exhibit is also unique to this project.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 270–276 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560062

True to our mission, the Sherratt Library at Southern Utah University (SUU) strives to provide our faculty and students with the latest technological tools for finding, accessing and evaluating information resources. The rapid development of computer and telecommunication technologies is changing not only on how we search, retrieve and use the information, but also the formats and media in which information is being packaged and shared. We assist faculty, students and community members in accessing all kinds of information. Resources arrive as text, images, animations, audio, and video, and each of these media can be found stored and shared in a wide variety of formats reflected by a mesmerizing list of file extensions:.txt, gif. jpg, .png, .pdf, .swf, .ra, .wav, .aif, .mid, mp3, .wma, .avi, .mov, .ram, .wmv. Computer capabilities that were considered “extras” a few short years ago are now becoming essential for our patrons. CD drives, sound cards, browser plug-ins, media players, and access to high capacity storage devices (ZIP, CD burners, etc.) are now basic tools available on all our public access machines. As we work to provide our patrons with the tools they need to access and use the vast amount of information now available we have also studied how we can take advantage of these same tools to provide greater access to our own in-house resources. In particular, during the past five years we have made concerted efforts to use the power of new technologies to improve access and encourage the use of materials in our Special Collections. Ever since our card catalog gave way to the online public access catalog, we have searched for ways to provide our Special Collections patrons with the same level of computer searching power enjoyed by users of our library’s book and serial collections. In 1998, we joined the b-test for the encoded archival description (EAD) and since that time we have created a large library of finding aids for our manuscript and photograph collections. The online search engine not only provides powerful searching capabilities for the EAD based catalogs but also offers direct access to digital surrogates of rare photos, documents, diaries, and more. Our patrons have responded very positively to the increased access and the use of these important resources continues to rise. As media technologies continued to evolve, we saw an additional opportunity for expanding accessibility to our Special Collections, and we Received: 21 January 2004 Revised: 8 April 2004 Accepted: 12 June 2004

270

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

began to experiment with ways to distribute non-paper based artifacts. Our Dean of Libraries and our Special Collections librarian shared an interest in oral history and these audio records became our next focus of study. We joined with other culture heritage institutions in our region to explore ways of digitizing and sharing oral history, and submitted a proposal to a federal granting agency to support innovative oral history project. That is how we began experimenting with .mp3, .swf, .rma and the rest, and how our online, multimedia, exhibits were born.

(7) Museums of Northern Arizona, Flagstaff, AZ; and (8) Utah State Historical Society, Salt Lake City, UT.

Getting started In 2000, we received a National Leadership Grant from the Institute for Museum and Library Services (IMLS) to explore the creation and dissemination of “new media” during a two-year project that we called Voices of the Colorado Plateau. The Voices project was a large scale collaborative effort involving three university libraries and five regional museums. The participating institutions and their location (number) on the map in Figure 1 are: (1) Cline Library, Northern Arizona University; (2) Lied Library, University of Nevada, Las Vegas; (3) Sherratt Library, SUU; (4) Edge of the Cedars State Park, Blanding, UT; (5) Iron Mission State Park, Cedar City, UT; (6) John Wesley Powell Memorial Museums, Page AZ; Figure 1 A map of the participating libraries

Together we wanted to take a fresh look at oral history and devise a way to introduce these valuable yet, underused resources to a wider audience. Building on the example of the American Memory Project and other Web sites that distribute oral history recordings via the Internet, we wanted to move beyond simply sharing the recordings and include multiple media elements that would make the audio more engaging and tell a greater story. We envisioned the final product as a series of online, multimedia, museum exhibits and began our work with this model in mind. To realize the vision, we relied heavily on the combined experience and expertise of the library, archive and museum professionals that had joined in the collaboration. As this will be discussed in greater detail in a subsequent section, these professional partnerships were a key element in the project’s success. We not only explored the uses of technology in creating and distributing the multimedia exhibits but also explored the use of technology in facilitating this important collaborative process. The overall goals of the project were the following. (1) To expand access to the cultural heritage collections of five regional museums and three university libraries by creating a virtual museum. (2) To establish a collaborative environment where a consortium of institutions could work together sharing materials and expertise to create online exhibits. (3) To establish and share innovative models for: . distributing audio artifacts in virtual museums; and . integrating a variety of museum/archive artifacts in various formats from several institutions to provide greater interest and context to virtual museum exhibits; (4) To provide educational and enriching exhibits on the popular Colorado Plateau region to a world-wide audience. As the title of the project implies, the thematic focus of the exhibits was the Colorado Plateau, more particularly the history and culture of the region in the early 20th century. In our inaugural meeting, held face-to-face on the campus of SUU, representatives from each institution were introduced to the project and we created a project outline to guide us through the two-year project. We also reviewed an initial design concept for the online exhibits prepared by the SUU team. For such a relatively small group, we had

271

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

a surprisingly wide array of individuals. Men and women from a variety of institutions and professions: archivists, librarians, historians, and curators, representing small, medium, and large institutions. The prototype unveiled at the first meeting was created using Macromedia Flash 4.0. All the partners liked the concept and the look and feel of the Flash “movies”, but the overall design and theme of the site required considerable renovations. The design team received valuable feedback at the initial meeting, but more online evaluations were necessary in weeks following the meeting to perfect the layout and menu structure. Experience and the literature all agree that time and energy upfront will more than pay for itself if it negates the need for serious revisions further down the road. To our credit and advantage, we were able to finalize the design of the primary pages and menu system before serious work began on the exhibits. However, we were tweaking second level pages for an entire year, and technology, not withstanding, our entire design was not finalized until a face-to-face meeting, a year into the project!

Responsibility for final selection, digitization, creation, and dissemination rested with the Project Director, Prof. Matthew Nickerson, and a team of talented students at SUU. The first task, undertaken by Nickerson was to identify a particularly fine three to five minutes segment/ story from the suggested interviews being sent in by the various partners. He relied on their knowledge and familiarity with their own materials, and more often than not partners would suggest a specific section for him to focus upon. Oral history is a rich medium and once energetic and engaging interviews were identified, there were many possibilities to choose from, and, though it was sometimes difficult to make the final cut, the number and quality of stories available validated the project’s initial view of the value and timelessness of oral history. As decisions were made and audio segments were isolated, the selected stories were then shared with the entire group. Telecommunication and computer technology was a key to the collaborative process. The sheer size of the geographies involved and the physical distance between partners was staggering for a project requiring this level of cooperation and sharing. The eight partners serve a geographical region of approximately 83,000 square miles, and to add insult to injury, this mountainous region is bisected by the Colorado River and the Grand Canyon. We knew from the outset that physical meetings would be limited and snail mail would prove too slow and cumbersome. Though we managed two face-to-face meetings per year we were able to extend those discussions throughout the project period using a variety of electronic means. For example, teleconferencing, allowed us to “meet” in groups at various times during the project when weather or schedule conflicts made face-to-face meetings impossible. Geography and distance aside, in any collaborative effort of this size, scheduling alone will prevent frequent meetings. No one working on the Voices of the Colorado Plateau project had the luxury of devoting their full-time to the effort; in fact, most of the participants had several projects, administrative responsibilities and/or teaching assignments to worry about as well. Yet, all were able to stay in touch and contribute to the project through time proximate but asynchronous communication was made possible by e-mail. This type of read-it-when-you-can-and-respond-assoon-as-possible communication is a key component to successful collaboration in the Internet Age. Technology also played a vital role in the sharing of resources required to create the collaborative

Gathering resources The initial outline for the project included over 50 topics arranged into three sections according to their importance as agreed upon by consensus among the group. Armed with the list, each representative returned to his/her institution with the assignment to search their oral history collections in search of stories related to our designated topics of interest. In addition to addressing the topic, we agreed that we were also interested in stories that were engaging. Admittedly a subjective measure, we took into consideration such things as the quality of the recording, charisma of the interviewee, emotional content, humor, and other intangibles. Choosing recordings was a difficult and often time consuming process. We relied heavily on the local knowledge at each institution and their familiarity with their own collection. As histories were chosen, copies of the interviews were sent to the project “headquarters” at SUU where the design and creation team was assembled. Choosing oral histories to include was always a balancing act between covering the identified topics and including truly captivating narratives. Short informative stories were valuable and necessary as were the painful or humorous anecdotes from everyday life. We trusted the collaborative process and the final result is a quality compromise: educational and engaging.

272

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

exhibits. While e-mail was the mainstay of our asynchronous discussions a wider variety of both software and hardware were required to send and share the text, images and sound files for constructing the exhibits. E-mail was a simple and common application among all our partners, but when it came to the more complex process of sharing resources, we had to learn to work within the realities of our situation. Just as they varied in size and mission the project partners enjoyed various levels of computer support and technological experience, as well. We learned, or were reminded, that the latest leading edge technology is neither always necessary nor even desirable. “What works?” Proved more important than “what’s the latest?” As each partner worked with the equipment they had, at the level they were comfortable with, we quickly developed means for completing the tasks at hand. Sharing of the audio is a good example of the range of protocols employed. Among the partners, we had original sound files recorded in a variety of formats including: Edison wire, reel-to-reel tape, audio cassettes, VHS video cassettes, and digital video cassettes. To employ any of these resources in the exhibits they needed to be digitized and sent to the design team at SUU and this was accomplished in a variety of ways. The Utah State Historical Society (USHS) and the Cline Library at Northern Arizona University (NAU) both enjoyed considerable experience and expertise in digitization technologies. USHS provided their audio on CD-ROMs while NAU uploaded audio files to the project ftp server. Edge of the Cedars Museum, at a small state park, was able to provide their interviews on CD-ROM after only minimal discussion with the design team at SUU. The John Wesley Powell Memorial Museum ( JWPM) in Page, Arizona, developed a productive working relationship with NAU, its closest “large partner” and they worked together to digitize some of the JWPM material while other interviews were sent directly to the design team on VHS video tape. The selection and acquisition of audio selections was an ongoing process throughout the first 18 months of the project. As final audio selections were made the process would continue individually for each story. The next step then would be to share the final audio snippet with the entire consortium. As described above, the original sound files were recorded in many different formats. The standard format for the exhibits was .wav, the format simplest to import and use in Flash. The larger institutions used a variety of creative hardware/ software combinations to transform selected audio into the required format. SUU relied on its library’s Special Collections not only for the

historic recordings but also had to borrow a reel-to-reel player/recorder from them as well. The fifty-year old player connected to the state-of-the-art multimedia computer was not only an interesting arrangement, but nicely illustrated the unique vision behind the entire project. The final .wav files were then shared throughout the partnership via Flash. Members would have a chance to review the story and offer input. When the story selection was finalized a call went out to all partners for images to illustrate the story. This process offers the clearest evidence of the power of the collaborative process in this project. Images were sought from all collections and not just from the source of the original oral history. By collating related materials from a variety of collections we were able to create educational and engaging exhibits that would be impossible for any of us working alone. All partners were given equal opportunities to share and each participated as it was able. Images were sent to the design team at SUU via a variety of means similar to the audio file sharing described earlier.

The exhibits Once the audio and images were selected, the SUU design team could begin assembling the exhibits. The overall layout for the site had been worked out previously with input from all of the partners. The first step was to import the sound into the new Flash project. The time line for the multimedia exhibit was determined by the sound file. Selected images were then imported into the Flash time line to correspond to the audio they illustrated (Figure 1). The designers used varying image sizes, background sound, cross fades, pans, and other video techniques to add variety and interest and to bring the story to life. Exhibits were begun as their materials became available so that the SUU computer lab was kept busy (Figure 2). As exhibits were completed, the team would post them to a temporary Web server where project partners could login and review them. Partners were particularly adept at editing and commenting on materials (either sound or images) from their own collections so it was valuable to have everyone view each exhibit as it was completed to insure quality and accuracy. As the number of finished exhibits grew the design team began creating the menus that would give patrons access to them. Early on in the process the partners had agreed that we wanted various access points to the exhibits and we settled on three menus: People, Places, and Topics.

273

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

Figure 2 Voices of the Colorado Plateau

In this way, patrons could find stories in any of the three ways: (1) by the name of the interviewee; and (2) by where the story (exhibit) took place, or by the subject addressed in the story. For example, John Williams’ enlightening account of how chainsaws changed the work of lumberjacks can be found under: (1) John L. Williams; (2) Coconino County, AZ; and (3) Timber. The site was designed so that each menu item points to at most four exhibits. Continuing the above example, choosing Timber from the Topics menu leads to four stories, from four different oral histories, discussing various aspects of the timber industry within the Colorado Plateau.

Interpretation Initial discussions among the partners revealed a desire to add interpretive information beyond the multimedia presentation. Here, the different backgrounds and institutions that were represented in the partnership fueled pointed discussion resulting in a very valuable addition to the Voices Web site. Going into the project we all recognized the many similarities between libraries, archives and museums. As cultural heritage institutions, we are dedicated to education, outreach and quality patron services. We were all

stewards of quality collections of which we are justifiably proud and we all endeavored to care for them while at the same time, sharing them with the communities we serve. Yet, there were significant differences among us as well. Generally speaking, a librarian’s main interest is providing quality access to the collection. Librarians are anxious to use all the information resources at their disposal in order to satisfy a patron’s educational/informational needs, while at the same time offering very limited, if any, interpretation. Example: Librarians are anxious to assist patrons in locating any or all of the art history resources in their collection but shy away from offering opinions on artists or art works themselves. Museum curators also have valuable collections that they wish to share with patrons yet, they work from a very different paradigm. They give patrons access to only a limited part of their collections and offer high quality, professional, educational interpretation for each artifact. They are qualified and anxious to interpret artifacts but they must bar patrons from most of their collection. Example: Curators will display and explain an ancient artifact, but they cannot offer interested patrons access to the hundreds of other similar items held in their collection. Generally speaking, archivists have a view, somewhere in between. Their knowledge and familiarity with their own collections qualifies them for varying levels of interpretive assistance, and they usually offer more general access to their collections but under controlled supervision.

274

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

With input from the entire group, we agreed upon a second level of information/interpretation that would build upon the multimedia presentation. This addition was designed to clarify, explain, and broaden the exhibit experience. At the conclusion of each exhibit three additional links appear: Context, Images, and Full Interview. The Context button provides the patron with a short historical essay placing the first person account of the exhibit into a larger historical context (Figure 3). Reminiscent of the kind of historical interpretation that might be provided at a bricks-and-mortar museum, this component has been applauded by reviewers. We hope that educators using the site will point their students to these important support materials. The essays were prepared under the direction of the Assistant Project Director, Dr Earl Mulderink, Associate Professor of History at SUU. This section also includes a short bibliography of sources appropriate for further study. The Images link allows users to review all of the photographs used in the multimedia presentation, in their original format. The Flash designers would occasionally crop photos in order to focus attention, fit the image into a desired layout or for other aesthetic reasons. The archivists and curators in the group were adamant that viewers of the exhibits have access to the originals. In addition, librarians, archivists, and curators felt it was vital that complete attribution for the images be included as well. The Images link provides both services, allowing patrons to scroll through all the exhibit images with the accompanying

attribution identifying where the original is housed, and cataloging information down to the item level. The Full Interview button links to an archive outside the Flash environment where the entire recorded interview, from which the exhibit was derived, is available for review via RealPlayer streaming technology. Again, it was the combined work of the group that led to this important innovation. Along with the complete recording, the user has the option to view a transcript (text) of the interview as well. The combination of audio and text has proved very useful to users researching within the full interviews. Working with both resources simultaneously allows researchers to experience the emotion and spontaneity of the recording while clarifying unclear passages by reading along in the transcript. The text has been particularly helpful with poor or aging recordings and when the interviewee speaks with an accent or vocabulary unfamiliar to the listener.

Evaluation and review A great lesson learned from this project was the value of input from unexpected sources. Though we had a very talented and knowledgeable group working on this project, our initial plan included outside experts to review the site once our design was finalized and we had sufficient exhibits to adequately represent our vision of the final product. Three historians with expertise in

Figure 3 Sheep herding example with a “Context” button

275

Online multimedia museum exhibits

Library Hi Tech

Matthew F. Nickerson

Volume 22 · Number 3 · 2004 · 270–276

Figure 4 An example of closed-captioning

American history and this region in particular, were asked to review the site. Their “outside” view was extremely helpful and their insightful evaluation covered all aspects of the site. Their reports were instrumental in several important additions. One reviewer, a published expert on the environmental history of the San Juan River (part of the Colorado Plateau river system) had a keen interest in the interaction between the river, land, and people. His insights into the factual content and history were valuable but not unexpected. However, his input in improving the overall aesthetic of the site came as a surprise to most of the design team. From his initial visit to the site, he was unhappy with the principle image and felt more could be done to make the home page reflect the overall theme of the site. Working with the project designers and our large collection of photos he helped find a new image that, once displayed, all agreed was a tremendous asset to the initial impact of the home page. The project director made a presentation to a group of faculty colleagues as part of a “brown bag” program aimed at sharing research results across his campus. An art professor (and good friend) who is deaf, spoke to him immediately following his presentation and challenged him to make the exhibit more accessible to the hearing impaired and the closed captioning appeared shortly thereafter (Figure 4). Again, the “outside” look proved hugely valuable. The captioning

amounted to very little additional work for a major improvement. The text resources were already available and the programming was simple and straight forward. Hearing “what’s wrong” with a project can be difficult for the creators that have invested time, energy and professional passion into a large project such as this one. The project directors fostered a culture of continuous improvement from the very beginning so that feedback, suggestions, and honest evaluation were always a part of the ongoing process. Improvement and fine tuning require multiple and frequent evaluations and team members learned to be open and accepting of evaluative reviews. Not all “expert advice” was taken and often good suggestions did not warrant changes, but the review process was essential to our success. It is clear that even big collaborative projects with lots of partners can benefit from additional reviewers from outside the partnership. Feedback from Web patrons indicate their appreciation for the multimedia nature of our site and the use of first person narratives to bring history to life. Though we are proud of our innovative use of technology in creating these online multimedia museum exhibits, we feel that our model and system of collaboration are equally important, and we hope that libraries, museums, archives and other cultural heritage institutions can benefit from the cooperative process described above.

276

Introduction

An online guide to Walt Whitman’s dispersed manuscripts Katherine L. Walter and Kenneth M. Price

The authors Katherine L. Walter and Kenneth M. Price are based at the University of Nebraska-Lincoln, Lincoln, USA.

Keywords Digital libraries, Archives

Abstract In November 2002, with funding from the Institute of Museum and Library Services, the University of Nebraska-Lincoln and the University of Virginia embarked on a project to create a unified finding aid to Walt Whitman manuscript collections held in many different institutions. By working collaboratively, the project team is developing a finding aid that is tailored to the needs of Whitman scholars while following a standard developed in the archival community, encoded archival description (EAD). XSLT stylesheets are used to harvest information from various repositories’ finding aids and to create an integrated finding aid with links back to the original versions. Digital images of poetry manuscripts and descriptive information contribute to an ambitious thematic research collection. The authors describe the National Leadership Grant project, identify key technical issues being addressed, and discuss collaborative aspects of the project.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 277–282 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560071

Walt Whitman (1819-1892), a highly influential poet and one of the most innovative writers in the US history, is famous for his inclusive vision of democracy, for his celebration of ordinary people, and masterpiece, Leaves of Grass, which redefined American literature. Despite Whitman’s centrality in American culture, his manuscripts have been little studied, and the poetry manuscripts, in particular, have never been collected and edited. Beginning in his teenage years, Whitman’s manuscripts were scattered widely when documents were sent to friends and left with newspaper publishers. As a correspondent, he did not routinely keep copies of letters. Visitors to his house in Camden, New Jersey, often described Whitman dipping into the sea of paper that surrounded him, a seemingly endless source of manuscripts that were divided among three literary executors after his death. Many of the papers left with the literary executors were dispersed at auction and then further dispersed at subsequent sales. From an archival perspective, it is impossible to determine an original order for the entire corpus of Whitman manuscripts. The chaos of Whitman’s papers was a point borne home to the project archivists by the fact that Whitman’s manuscripts are now scattered in over 60 different institutional repositories, and poetry manuscripts have been located in 29 repositories. Because the materials are widely dispersed and irregularly documented, scholars or general readers interested in the development of Whitman’s poetry – through multiple drafts to finished work – cannot locate and examine the relevant documents without great expense of time and money. Whitman scholarship is complicated also by the fact that the poet only occasionally titled his manuscripts, and when he did, he often used a title different from that employed in any of the six distinct editions of Leaves of Grass. Furthermore, Whitman’s drafts of ideas for his poems, his first treatment of key images, and his initial explorations of rhythmic utterances sometimes began as prose jottings that were gradually transformed into verse. For example, in the case of his great elegy for Lincoln, “When Lilacs Last in the Dooryard Bloom’d,” Whitman jotted down bare lists of words that provided a kind of chromosomal code for the fully realized poem. Thus, for a number of reasons it is difficult to correctly identify and categorize Whitman’s manuscripts. Received: 5 February 2004 Revised: 26 April 2004 Accepted: 12 June 2004

277

An online guide to Walt Whitman’s dispersed manuscripts

Library Hi Tech

Katherine L. Walter and Kenneth M. Price

Volume 22 · Number 3 · 2004 · 277–282

The Walt Whitman Archive, an ambitious online scholarly project conceived by a team of scholars headed by Kenneth M. Price, University of Nebraska-Lincoln, and Ed Folsom, University of Iowa, began in 1995. It is a thematic research collection (Palmer, 2004) that sets out to make Whitman’s vast work electronically accessible to scholars, students, and general readers. The site located at http://whitmanarchive.org is maintained on a server at the University of Virginia’s Institute for Advanced Technology in the Humanities (IATH). The goal of the Whitman Archive is to create a dynamic site for research and teaching that will grow and change over the years. Editorial work on the poetry manuscripts is supported by the National Endowment for the Humanities. In order to advance the editing project and to increase the understanding of encoded archival description (EAD), a complementary project was undertaken by the University of Nebraska-Lincoln and the University of Virginia. This project, funded by the Institute of Museum and Library Services (IMLS) from 2002 to 2004, is entitled “An Integrated Finding Guide to Walt Whitman’s Poetry Manuscripts.” The purpose of the IMLS-funded project is to make an inventory of Walt Whitman manuscripts in various repositories and to provide access to the manuscripts through the Walt Whitman Archive. Aided by such standard references as Walt Whitman: a Descriptive Bibliography by Joel Myerson, American Literary Manuscripts, and Archival Resources, the Whitman Archive team has identified an estimated 70,000 manuscript items produced by or relating to Walt Whitman. Several thousand of these manuscripts are poetry manuscripts. The scholars and archivists working on the project fully expect that other manuscripts will appear as private collections pass into institutional hands or are offered on the auction market. Though all of the manuscripts located will be included in the finding aids of individual repositories online, we are particularly focusing on enhancing the descriptions of poetry manuscripts. One of the goals of the IMLS project is to increase the public’s understanding of Whitman as a foundational figure in American culture. The enhanced finding aids and the accompanying digital images developed as part of this project help the public gain new insight into the development of Whitman’s poetry, providing a wide audience new understanding of the creative process that brought about some of the most moving and memorable poems ever written in the US. Readers who otherwise have little access to manuscript reading rooms are able to see that Whitman, who often praised spontaneity, was himself an incessant reviser: his works did not magically appear fully

formed but instead reached their often majestic state through complex processes of trial and error and painstaking reiterations and revisions. Whitman scholars are adding to the archival descriptions of poetry manuscripts to help readers understand those processes by situating the archival material in its wider intellectual context. As part of the freely accessible Walt Whitman Archive, the images and finding aids provide teaching and scholarly opportunities not otherwise available. The University of Nebraska-Lincoln’s project objectives are to: . create a unified guide to Whitman’s manuscripts, with special enhancement of the descriptions of poetry manuscripts; . produce several thousand digital images from poetry drafts found in various collections; . establish best practices for using EAD to identify manuscripts for dispersed, crossrepository collections; . develop a model for scholar-archivist collaboration; and . build a search interface.

Collaboration From the beginning, the project was conceived as one in which the scholars, archivists, and librarians would work together collaboratively. Each community brings special skills to the development of the online integrated guide. There are really two teams of individuals who have been involved. The first team is the overall research group, including consultants on the project. . Scholars. Kenneth M. Price, UNL; Ed Folsom, University of Iowa; John Unsworth, University of Virginia’s IATH; Brett Barney, UNL; Andrew Jewell, UNL; and other graduate students in English departments at all three institutions. . Librarians. Katherine Walter and Brian Pytlik Zillig, UNL; Terry Catapano, Columbia University; and Heike Kordish, New York Public Library. . Archivists. Mary Ellen Ducey, UNL; Daniel Pitti, IATH; Kris Kiesling, University of Texas at Austin; Steve Hensen, Duke University; and AnneVan Camp, Research Libraries Group. The second team – the UNL EAD project team – is a subset of the larger group. This smaller group is composed of the UNL and IATH faculty and staff noted above. One of the first meetings of the overall research group (University of Nebraska-Lincoln, University of Virginia, and collaborators) was held in Lincoln,

278

An online guide to Walt Whitman’s dispersed manuscripts

Library Hi Tech

Katherine L. Walter and Kenneth M. Price

Volume 22 · Number 3 · 2004 · 277–282

Nebraska and facilitated by IATH’s Daniel Pitti. As the group discussed the issue of enhancing descriptions, most of the archivists on the team noted that special funding is typically required to provide item-level descriptions or calendar-level information, and that such information is not usually needed by most users. The scholars and the archivists recognized, however, that a digital thematic research collection that centers on a national icon like Whitman may require item-level descriptions, whereas other collections may not merit the staff time. In such instances, the collaboration between scholars and archivists can be very valuable. The overall research group participates in an archived e-mail discussion list, and various individuals have had both face-to-face and phone meetings. Daniel Pitti and others from collaborating institutions have responded to encoding models proposed by members of the EAD project team at UNL. The EAD project team, including Ducey, Jewell, Barney, and Pytlik Zillig, worked to coordinate EAD implementation across collections in a way that ensures the interoperability of records produced by different institutions and participants. All members of the Nebraska team meet weekly to discuss issues concerning the unified finding aid. Frequent communication among the librarians, archivists, and scholars has enriched the project and kept it on track.

the intent to harvest the descriptions of poetry manuscripts into a single unified finding aid later in the project. As mentioned earlier, many of the individual repositories’ finding aids or catalog cards were not in digital form, and the ones that were in digital form were not necessarily encoded following current archival standards. In order to harvest data, the finding aids had to be encoded. EAD is a standard for encoding archival finding aids using SGML or XML, and it is this standard upon which the primary work of the Whitman project is based. As a standard for archival description, EAD is designed to encode finding aids in such a way that the contents of various collections can be searched uniformly online. Consequently, Nebraska developed a project-specific model for institutions not contributing their own encoding. Mary Ellen Ducey, UNL archivist, and Andrew Jewell, a graduate student in English at UNL, developed draft EAD documents for institutions without electronic finding aids and received recommendations for changes from Daniel Pitti, IATH, and Kris Kiesling, University of Texas at Austin. This model was shared among the major participating institutions and accepted with slight revisions by the participating archivists. By establishing a project-specific model, the team ensured that the ability to harvest specific fields from the repository finding aids using XSLT would be facilitated. As part of the project, we request digital images of poetry manuscripts from the holding repositories to allow scholars to enhance the finding aids. Repositories are asked to provide 24-bit color TIFF images with a minimum resolution of 600 dpi, presented in context. Thus, when a poem is written on the back of a letter or an envelope, or when it is one of a group of related pages, images of the contextual materials are also obtained. We also seek permission to post derivative JPEG and thumbnail images on the Walt Whitman Archive. With the image in hand, the individual repository’s finding aid is reviewed to determine if there is additional information that would be useful to the scholarly community. For example, in the University of Tulsa’s collection of Walt Whitman Ephemera, there is a manuscript called “[Poem describing a perfect school.]” This leaf has writing on both sides, as noted by the archivist at the University of Tulsa: “Written in pencil on 8vo sheet with portion of another poem, also in his hand, on verso.” We have identified the poetic lines written on the verso as part of an extremely important Whitman poem, “To Think of Time,” and have enriched the description as follows: “The verso lines, beginning ‘The three or four poets are well,’ were included, in a revised form, in Whitman’s

Technical issues Descriptive information Descriptive information about Walt Whitman’s manuscripts in many repositories was received in various forms, such as: . photocopies of older catalog cards for single items, apparently never digitized; . hand-typed finding aids; . finding aids produced with word processing software, but not encoded; . finding aids produced in HTML; and . fully-encoded EAD finding aids (XML). Not surprisingly, some finding aids, containing only a limited number of original Whitman manuscripts, describe the papers of Whitman associates or Whitman collectors. An early decision was to limit, at least for now, the descriptions created for the Walt Whitman Archive to those manuscripts actually written by Whitman himself. Someday it may be possible to add other archival materials, including documents by associates and collectors. One of the most important decisions was to develop repository-specific finding aids, with

279

An online guide to Walt Whitman’s dispersed manuscripts

Library Hi Tech

Katherine L. Walter and Kenneth M. Price

Volume 22 · Number 3 · 2004 · 277–282

poem ‘To Think of Time,’ first published without a title in the 1855 edition of Leaves of Grass, as ‘Burial Poem’ in 1856, ‘Burial’ in 1860 and 1867, and under its final title in 1871.” Aside from the fact that many repositories did not have finding aids per se to provide, a further complication was that the levels of description received from various repositories ran from extremely sketchy to very detailed. The Nebraska project team produced finding aids as best it could, based on the existing description and, typically, digital surrogates of the items, though some repositories continue to be slow in providing scans. The following description taken from the paper finding aid of the Livezey-Walt Whitman Collection at the Bancroft Library of the University of California-Berkeley, demonstrates how little the project sometimes had to work with:

one of the architects of EAD, representatives from the University of Virginia, New York Pubic Library, Columbia University, the University of Texas at Austin, Duke University, the University of Iowa, and the Research Libraries Group discussed the encoding needed to develop a unified finding aid to dispersed manuscripts and articulated the desired outcomes of the project. As described above, the wide dispersion of Whitman’s manuscripts throughout his lifetime and after his death makes it impossible to determine an original order. In an article entitled “Disrespecting Original Order,” Frank Boles notes that the concept of original order is less relevant for collections of personal papers than for governmental or institutional records. He argues that “original order is to be respected when it is usable, but . . .a theory of simple usability can guide archivists when original order becomes inadequate” (Boles, 1982, p. 32). In the case of Whitman’s manuscripts, the research group concluded that scholars would be best served by creating a single, integrated guide to Whitman’s poetry manuscripts. As envisioned, the Walt Whitman Archive would display the images of the poetry manuscripts (work by the author) with enhanced descriptions (work by librarians or archivists and scholars), and a citation and link to the individual holding institution’s finding aid (work by archivists). Thus, an image of an 1881 corrected proof of Whitman’s “The Dalliance of the Eagles” would be accompanied by the citation “Library of Congress, Charles A. Feinberg Collection,” by additional scholarly descriptions or notes concerning the poem, and by a link to the Library of Congress’s finding aid on the Walt Whitman Archive. Then, using XSLT stylesheets, poetry manuscripts for each of the poems would be united into a single alphabetical list regardless of location. In effect, the online unified guide creates a virtual order for Whitman’s poetry manuscripts. Though in concept and design the unified guide was simple, its development proved to be more complicated. Planning how to unite the finding aids for poetry manuscripts described in over 30 finding aids at 29 repositories (some repositories have more than one Whitman collection) has offered some interesting challenges. As the project progressed, the UNL EAD team had to address how to assign uniform titles to various drafts. This process is described in the “Work identification. . .” section of the article that follows. Once this issue was resolved, the team was able to develop a series of stylesheets to create an integrated guide. The steps are shown in Figure 1, entitled “Integrated Guide to Walt Whitman’s Poetry Manuscripts: the XSLT transformations.”

.

“Wood Odors” (poem) Holograph Ms.

Working from an image of this manuscript and using other supplemental information from reference works on Whitman, scholars on the project were able to elaborate this description considerably. The description as it now appears in the Whitman Archive follows: Item: 1 Title: “Wood Odors” Date: ca. 1875 Physical Description: 1 leaf, handwritten A draft of a poem unpublished in Whitman’s lifetime entitled “Wood Odors.” The poem was apparently written as Whitman was making notes for his 1882-1883 book, Specimen Days. Specifically, the poem appears to respond to the visit he made to the Stafford farm in New Jersey in the mid-1870s. Some have argued that this draft is not a poem at all, but a list of phrases toward the composition of Specimen Days (see David Goodale, “Wood Odors,” Walt Whitman Review 8, [March 1962], 17). “Wood Odors” was published first in Harper’s Magazine, 221 (December, 1960), 43.

Integrated guide to dispersed Walt Whitman manuscripts The importance of uniform approaches to EAD became very evident when working on the integrated guide. In March 2003, IATH convened a group of scholars, archivists, and librarians in Lincoln, Nebraska. Scholars Price and Folsom described the value of developing a resource where one would be able to find information on all of Whitman’s manuscripts, and, especially, on the various manuscript drafts and notebook versions of each of the more than 300 poems Whitman published in Leaves of Grass and of the additional poems (approximately 125) that he did not include in his masterpiece. Led by Daniel Pitti of IATH,

280

An online guide to Walt Whitman’s dispersed manuscripts

Library Hi Tech

Katherine L. Walter and Kenneth M. Price

Volume 22 · Number 3 · 2004 · 277–282

Figure 1 Sequence of XSLT transformations

Essentially, item-level information is drawn from many different levels of the constituent finding aids (from ,c01. through ,c05.) and redeployed in a “flat” file structure, so that all item-level information about poetry manuscripts is expressed at the same level (,c01.). For a more detailed illustration of the stylesheet, and to see how the component EAD files are gathered see Figure 2. Next, a second stylesheet (Figure 3) organizes related manuscript items as ,c02.s within ,c01.s. A third stylesheet transforms the EAD Integrated Guide to HTML for display in the browser. In this way, we are able to display and group all drafts of a particular poem together. To see the Integrated Guide to Walt Whitman’s Poetry Manuscripts, go to http://whitmanarchive. org, and click on “Manuscripts.”

To address this problem (and for other practical reasons), Price, Folsom, Pitti, and Brett Barney, an encoding specialist on the project, developed a system of identification which embeds in the union finding aid relationships between manuscripts of Whitman poems and the conceptual “work” they

Work identification or uniform titles The first attempt to generate the integrated union finding aid was exciting (i.e. the stylesheets flattened the files as desired), but grouping the poetry drafts was impossible without some means of identifying like drafts. As noted earlier, Whitman was amazingly prolific, and he did not consistently name his poetry drafts in ways that logically or meaningfully grouped the drafts. Figure 2 First stylesheet gathers all component EAD files and creates a flat ,c01.

281

Figure 3 Second stylesheet organizes related manuscript items as , c02.s within ,c01. s

An online guide to Walt Whitman’s dispersed manuscripts

Library Hi Tech

Katherine L. Walter and Kenneth M. Price

Volume 22 · Number 3 · 2004 · 277–282

contribute to. The title of the “work” is derived from the final manifestation of the poem, most often the version Whitman published in his final, or “deathbed,” edition of Leaves of Grass (1892). For example, manuscript drafts of poetic lines that were later incorporated into “Song of Myself” are flagged with the ID for that poem. This innovative encoding of an individual manuscript’s relationship to a Whitman “work” allows us to enable a highly-valuable, automated organization of the integrated finding guide: we can group all dispersed manuscripts that relate to a specific poem. The use of “work” identifiers, for example, will allow a user of the integrated guide to use a well-known title, such as “When Lilacs Last in the Dooryard Bloom’d,” to perform searches and retrieve several manuscripts – from notes to lists of words to trial lines to corrected proofs – that may be held at different repositories.

Whether a particular collection in an archive has a finding aid with a single summary description or a more detailed finding aid is determined by the priorities of individual repositories. While scholarly and general user interests are a factor in setting such priorities, many factors may be considered in making decisions concerning collection processing and retrospective conversion of finding aids. Rarely does an archive have the luxury of creating item-level descriptions for manuscript or other collections. Occasionally, however, there are collections of such importance that item-level descriptions are essential. We believe that Whitman’s role as a foundational figure in the US merits this approach. Few of America’s great writers continue to generate as much interest in the wider culture as the poet of Leaves of Grass. Over a century after his death, Whitman is a vital presence in cultural memory: television shows and films depict him, musicians allude to him, advertisers appropriate him, schools and bridges are named after him, and politicians invoke him. Truck stops, think tanks, summer camps, corporate centers, and shopping malls bear his name. He is part of the very fabric of American life, its past, present, and no doubt future as well. Electronic access to manuscripts offers opportunities for studying his scattered works that were unimaginable in the past. When our project began, the Walt Whitman Archive was receiving an average of 3,000 visits per day. As of January 2004, the site averages 10,000 visits per day. Based on the experience of the Walt Whitman Archive and the usage of the site, we believe that the general framework for developing the unified finding guide, with enhanced descriptive work by scholars, is worth replicating in other research communities to enrich resources pertaining to other highly significant writers or topics.

Conclusion On the access side, the project is providing researchers an opportunity to experiment with methods for virtually reintegrating dispersed collections of Whitman manuscript materials using the standard for archival description, EAD; on the social and intellectual side, the project offers an unusual opportunity to experiment with a deeper engagement between scholars and archivists, in which scholars might enrich the item-level descriptions of archival materials. Through the coordinated effort of a large number of libraries and institutions, the project has demonstrated how best to utilize and integrate EAD records made and maintained at disparate institutions by different creators. By applying EAD consistently and by taking advantage of XML and XSLT, the project team has been able to develop a virtual collection of Whitman poetry manuscripts. Though archivists and librarians may not typically construct virtual collections (Westbrook’s, 2002), there may be times when manuscripts of a particular individual are so scattered that a virtual collection may be the only way in which to make sense of chaos. The collaborative team for the Whitman’s project has found that the different perspectives brought from each of the communities (scholars, librarians, and archivists) has enriched discussions and offered a creative approach to uniting split collections.

References Boles, F. (1982), “Disrespecting original order”, The American Archivist, Vol. 45 No. 1, pp. 26-32. Palmer, C.L. (2004), “Thematic research collections”, in Schreibman, S., Siemens, R. and Unsworth, J. (Eds), A Companion to Digital Humanities, Blackwell, Malden, MA (in press). Westbrook, B.D. (2002), “Prospecting virtual collections”, Journal of Archival Organization, Vol. 1 No. 1, pp. 73-80.

282

Introduction

The Maine music box: a pilot project to create a digital music library Marilyn Lutz

The author Marilyn Lutz is Director for Library Information Technology Planning at the University of Maine, Orono, Maine, USA.

Keywords Music, Audiovisual media, Digital libraries

Abstract The Maine Music Box is an interactive, multimedia digital music library that enables users to view images of sheet music, scores and cover art, play back audio and video renditions, and manipulate the arrangement of selected pieces by changing the key and instrumentation. In this pilot project the partners are exploring the feasibility and obstacles of combining collections, digital library infrastructure, and technical and pedagogical expertise from different institutions to implement a digital music library and integrate it into Maine’s classrooms. This paper describes the methodology for digitizing, processing and providing access to electronic resources owned by two libraries and hosted by another, and the use of those collections to develop an instructional tool keyed to the digital library.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 283–294 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560080

The Maine Music Box (MMB) is an interactive, multimedia digital music library that enables users to view images of sheet music, scores and cover art play back audio and video renditions, and manipulate the arrangement of selected pieces by changing the key and instrumentation. This library of digital resources is integral to an online music education channel that provides an instructional process keyed to the images of scores. The impetus for the endeavor is a unique collaborative effort within and among diverse institutions and individuals. The project demonstrates how the collections of one library can be enriched with the tools of information technologies from another library, and the resulting digital collection and services made available to support and advance the broad education mission of libraries. Rural Maine libraries have a significant history of collaboration, sharing common interests and goals in order to deliver information and services that support all aspects of business, education, government, and recreation. Through collaborative initiatives led by the organizational partnership of the University of Maine Library and the State Library, Maine’s libraries and cultural organizations have worked in a coordinated manner to ensure wise use of limited resources and, more recently, to use digital content in the service of the state’s economic, educational and cultural development. Building on this history of collaboration and common vision, the Maine Music Box is a two-year pilot project funded by a grant from the Institute of Museums and Library Services (IMLS) (October 2002) with matching funds from Fogler Library, University of Maine (Orono, ME), and its partners, the Bagaduce Music Lending Library (Blue Hill, ME) and the Bangor Public Library (Bangor, ME). The project seeks to explore the feasibility of and obstacles to combining collections, digital library infrastructure, and technical and pedagogical expertise from different institutions to implement a digital music library and integrate it Received: 13 January 2004 Revised: 30 March 2004 Accepted: 12 June 2004 This project is funded in part by the Institute of Museums and Library Services. The Maine Music Box was Co-Directed with the author by Kurt Stoll, Executive Director of the Bagaduce Music Lending Library, and implemented by members of the project team: Patrick Harris, Matthew Aplin, Eugene Daigle, Sharon Fitzgerald, Nancy Lewis, Curtis Meadow (Trefoil, Inc.), Richard Merrill (Pine Graphics) and members of the Curriculum Advisory Board: Anatole Wieck, Laura Gallucci.

283

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

into Maine’s classrooms. The joint effort has brought together individuals with widely varying backgrounds, each of whom has in-depth knowledge of a particular domain critical to providing access to the music: metadata and cataloging, music and music education, library science, collections of printed sheet music, scores, graphic design, database design, interactive Web programming, and network administration. When complete, the MMB will make available collections that expand the virtual catalog of sheet music in the United States, provide access to a digital learning environment for educators and students, and enable residents of the state’s rural, scattered population to enjoy music resources that would otherwise be inaccessible. This paper provides an overview of the Maine project and within that context, discusses the particular nature of the collaboration and technological effort needed to create digital access to important music collections within a learning environment. While the project is still in progress, we are able to identify a number of important considerations to bear in mind as you read the project story. . The most important decision you will make is the choice of collections to digitize. . The second most important decision is to hire staff that are musicians: graduate and undergraduate music students are an excellent source of expertise. . Outsourcing the digitization made the project possible, and the choice of an experienced and accommodating vendor was critical to our success. . Issues of preservation copies of electronic resources are unresolved. . Do not under estimate the task of inventory control, physical processing, scheduling, and the space needed to store and move the data. . The question of whether optical music recognition (OMR) software saves time depends on the quality and type of music notation at hand, and staff editing the music. Pianists and composers who are able to read multiple staves of music perform most efficiently. . Collaboration among individuals and institutions is a work of art in itself.

organization that houses a collection of over 400,000 pieces of sheet music, scores, and printed music that the library lends to professional musicians and educational institutions. Of concern to the Bagaduce Board was providing broader access to their rapidly expanding collections and creating archival copies of fragile and deteriorating paper sheet music and scores. For a small organization like the Bagaduce Library, the digitization of parts of the collection was impossible without outside fiscal and technical assistance. Fogler Library wanted to leverage its investment over the past decade in information technology infrastructure and expand its experience in the digitization of text, image and audio collections by supporting access to significant music collections in Maine. Both libraries had examined other digital sheet music collections on the Web[3], and followed music research at a number of institutions, including Johns Hopkins University with the Levy Collection[4] (Choudhury, 2000a, b) and the VARIATIONS Project at Indiana University[5] (Dunn and Constance, 1999). Beyond the digitization of the collection they were interested in creating an instructional tool in a digital learning environment with the potential to enrich the experience of music educators and students. The Bangor Public Library[6], which serves as the music library for the Bangor Symphony Orchestra (the oldest continuously performing community orchestra in the US) held a number of unique music manuscripts, which complemented the Bagaduce collections, and the library shared the project’s overarching interests. Fogler Library proposed undertaking a digitization project, using off-the-shelf software, to deliver a digital music library and instruction tool with the Bagaduce and Bangor Libraries’ collections of sheet music, scores and manuscripts that would be hosted by Fogler’s technology infrastructure and technical staff, thereby broadening access and increasing the scholarly value of the collections. Through digitization, musicians, scholars, educators, students, and the general public would be able to search textual data and retrieve images of scores or sheet music and cover art, link to the full text of lyrics, hear selectedcomputer generated sound files, and link to other digital versions of a piece. The system interface, which manages the delivery of the images combined with the customizable options for the associated sound files, would enable instructors to integrate the digital music library in teaching and learning. Preservation copies would be created and delivered to the partner libraries. The archive would also be accessible through a Web-based instructional channel integrated with the music database.

Overview of the project proposal Background In the winter of 2001 preliminary discussions took place between Fogler Library[1] and the Bagaduce Music Lending Library[2], a unique, non-profit

284

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

The Bagaduce staff selected four collections of music scores, manuscripts and sheet music, totaling 22,641 titles or 114,517 pages, for digitization and inclusion in the pilot project based upon a set of agreed-upon criteria. The Bangor Public Library contributed one collection (refer Appendix 1, Music Box Collections). The condition of the original materials, their historical importance, and the need to preserve and broaden access to them through digital conversion were primary considerations. Other criteria in the selection process concerned were the following. . Copyright status. Of the selected works 68 percent are in the public domain and 32 percent are protected by copyright. . Availability of metadata. Collections with existing cataloging were preferable. . Feasibility of image capture. The physical condition of the published sheet music and scores and their overall characteristics (format, text and illustrations, color, art, level of detail in music notation) tested well; manuscript collections would prove to be more of a challenge. . Feasibility for sound file conversion. The quality of the sound rendition will depend on the quality of the bitmap obtained from the image files. . Relationship to other digital sheet music collections. Eighty percent of the collections are unique or rare.

Bagaduce Library was interested in finding a vendor from whom services beyond the project could be purchased. After three months of investigation and an additional two months to satisfy the University’s purchasing requirements, we selected VTLS, Inc. as the vendor[7]. In a series of conference calls the project co-directors finalized specifications with VTLS for delivery of digital files. For each score or piece of sheet music VTLS would deliver: . TIFF (300 dpi RGB) and bitmap (300 dpi – 1bit) file formats; . derivative JPEG images (access quality [72 dpi RGB] and thumbnail [115 £ 150 pixels]); . minimal level MARC records; . text of lyrics; . Administrative metadata. filename, resolution (pixels), bit depth, image height (pixels), image width (pixels), ICC color profile, date/ timestamp and BML number; and . Preservation CDs (refer Appendix 2 Vital Statistics).

The Maine Project, which began officially in October 2002, faced two technical challenges: (1) to determine the methodology for digitizing, processing and providing access to the electronic collections; and (2) to develop a digital music library tool to support teaching and instruction in Maine schools.

Digital conversion Outsourcing the digitization The project budget included funds to outsource the digitization process. Since the primary collections were housed at the Bagaduce Library, a collection manager on their staff was responsible for pre-digitization and post-digitization activities. Project staff was housed at Fogler Library where equipment and staff were dedicated to the project, and responsible for tasks involved in the implementation of the MMB. The decision to outsource the digitization of the collections was the most cost-effective way to proceed within the project budget, and the

A BML (Bagaduce Music Library) number assigned to each piece of sheet music or score became the unique key to link all the electronic files for a given piece of music. Processing schedules were arranged for the delivery of print music to and from VTLS, and the FTP delivery of the electronic files to Fogler Library for further processing by project staff and loading into the database system. The contract called for VTLS to complete digital conversion during the ensuing eight months. Changes resulting from a test run During a test run using 100 digitized pieces of sheet music and scores, typical of the overall collections, a number of changes in methodology became necessary. It quickly became apparent that moving large image files across the network from Virginia (VTLS) to Maine (MMB) via FTP would take too long, and appeared too risky to all concerned. We decided to purchase four 200 GB external disk drives, which could be shipped back and forth in the processing cycle. This change in plan required a series of programs to copy image files from external drives to the appropriate directory structure on the MMB server. Based on the time needed to create CDs of TIFF files for the 100 test scores, we decided to shift the preservation copy work from VTLS and have project staff copy TIFF files delivered from VTLS on DVDs (est. 800 DVDs) rather than the estimated 4000 CDs, twice the original estimate, and therefore, twice the cost – for which there was no sufficient budget or storage space. We developed a proposal to create “archival copies” in-house for the allotted budget and integrated the production into the workflow.

285

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

While the consensus is still being formed on “archival” medium for electronic resources, the DVD format has a life expectancy of 50 plus years. Our choice to make preservation copies on DVDs was dictated by budget considerations. TIFF files are copied from an external drive shipment to the DVD burning workstation, and organized in DVD folders. A directory listing text file is created for each DVD, and the DVD is burned using the DVD-R standard. We selected the DVD-R standard because it has emerged as the more popular DVD standard over DVD+R. Each DVD holds a maximum of 4.35 GB (or 35-50 scores). Labels are printed for the DVD and its jewel case, and each DVD is tested by selecting seven files at random to verify the quality of the burned image and contents before shipment to the Bagaduce Library where staff conduct additional quality control checks. With the test run, we obtained some realistic estimates of storage needed to support the project. The music box system will require an estimated 3.86 TB of storage. TIFF archival files require 2.36 TB, and the bitmap files used to create sound files require an estimated 1.5 TB. Access images (jpegs) require 23 GB. The current system has 1.5 TB. TIFF files are copied to DVD, and are also being stored on tape for the duration of the project. Bitmap files used to create sound files are also temporarily stored on tape once an associated sound file is created. Lastly, the test run made clear the need to avoid any lulls in the delivery of print materials to VTLS, electronic files from VTLS to Fogler, and the return of print materials to Bagaduce. The schedule was crucial in order to avoid a situation where staff and equipment would be idle during project time, and resources unavailable to Bagaduce patrons.

according to AACR2 rev. Chapter 5 Music. Bibliographic records are then exported in batches, and loaded into the MMB server. Final revisions to the MARC records are made using a cataloging interface to the MMB database, subsequently designed for this purpose. The MARC records are mapped to the Dublin Core Metadata Element Set for an open archives initiative (OAI) data provider service. Records are currently harvested for an IMLS project that is building a collection registry and item-level metadata repository of IMLS digitization projects[8] at the University of Illinois at Champaign – Urbana. We also anticipate contributing metadata to the Sheet Music Consortium OAI Project[9], based at the University of California, Los Angeles.

Creating metadata Keeping to a tight schedule was essential. Of particular concern was the need to coordinate delivery of images and catalog records, since the project cataloger, responsible for enhancing the minimal MARC records from VTLS, was working from images of the sheet music or scores. To address this concern VTLS provided access to a Virtua Cataloging Module on their server: base records created by VTLS are loaded into the Virtua database and the project cataloger, using image files on the MMB server, adds subject headings and performs authority verification (Library of Congress Authorities, Library of Congress Subject Headings, Art and Architecture Thesaurus, and the Library of Congress Thesaurus of Graphic Materials), adds local thesaurus terms provided by Bagaduce staff, and further enhances the records as necessary

Inventory control and physical processing The level of effort involved in managing the volume of electronic files for each piece of sheet music or score delivered in 200 GB shipments was much higher than anticipated. This miscalculation resulted in unanticipated programming needs to keep the inventory database current. Project staff setup an MS Access database on the MMB server to manage file inventory as it was delivered from VTLS and to monitor in-house processing of digital files (cataloging, creation of sound and archival files), copyright status of a title, and counts of image pages. A VBScript program from the inventory database generates database records giving URLs of various associated files. A set of Access queries and VBScript programs are used to reconcile the inventory database with the actual contents of the server file system and the music database, report inconsistencies, duplications, missing images, and automatically update the inventory and workflow database (delivery date, batch name, individual file types, and number of pages). Another program merges data from an incoming shipment with files in the database. In addition to providing a snapshot of progress, the reports are essential for reconciling the vendor’s invoices with data delivery. Quality control and creating associated sound files – Optical Music Recognition In addition to the quality control procedures conducted by VTLS before files are delivered, image quality is checked by the Music Cataloger who uses the jpeg images to enhance bibliographic records, by staff creating DVDs using TIFF files, and especially by staff creating audio files who use the bitmap files. Optical music recognition (OMR) software has three commercial contenders, SmartScore Pro 3.1, PhotoScore 3.£ , and SharpEye. Music XML 1.0

286

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

is the most recent entry ( January 2004) for encoding music, and research on OMR is in progress at a number of institutions. Opinion varies as to which program is most accurate for converting images of sheet music into computerized music notation. Like optical character recognition programs (OCR), the issue is whether any OMR program that requires editing saves time over manual encoding from the start. The quality of the original music and type of music influence the choice and speed with which OMR can be used. We chose to use Sibelius[10] which is packaged with PhotoScore to produce the interactive Scorch sound files. The cataloger flagged pieces of sheet music and scores that were candidates for scanning (PhotoScore software) and saving either as Scorch files (read by Sibelius software) or as MIDI files (read by most music editors). Criteria for creating MIDI files with associated sheet music and scores included: if the piece was written by a well-known composer, remains popular today, is significant to the study of popular culture of the time, has historical significance, is representative of the social life of the time, is important to the study of music, and is maritime or Maine related. MIDI files are not created if the source document is poor or restricted by copyright. Sheet music and scores translated to the Scorch format are based on the recommendations of the Curriculum Advisory Board or faculty request (which may include scores other than sheet music). Both PhotoScore and Sibelius are used to edit the music notation mistakes that PhotoScore makes. Editing the scanned scores is extremely labor intensive and requires staff who are able to read multiple staves of music to perform the task. In our experience the best musicians suited for the project were pianists and composers, and the project team was fortunate to have staff with these skills. We averaged four sheet music pages an hour. While sheet music and scores saved in both MIDI and Scorch formats can play back music, only Scorch files are interactive, using a plug-in from the Sibelius Web site. The user can view the formatted sheet music or score within a Web page and listen to the music rendition as a cursor follows the sound played back. The user also has options to transpose and vary the tempo.

bibliographic records, the database had to accommodate the storage and display of music lyrics and an interactive instructional component with specialized database structures. The bibliographic records and lyrics were therefore ported into a standard relational database. Though bibliographic records are received in MARC format from VTLS, the records are processed with MARC Breaker software[11] (freeware available from Library of Congress) into a text format, which is then parsed and processed by the database loader. Full-text indexing was critical to retrieval, and SQL Server full-text indexing and search capabilities allow for fast, simultaneous searches across any number of fields by combining fields into appropriate index columns. We implemented five different index columns: General (containing all names, titles, notes, and subjects); Names (containing names associated with a piece of music); Art (containing names and subjects associated with artwork only); Subject (containing subjects only) and Lyrics (containing lyrics only). The Keyword search uses both General and Lyrics indexes while the other searches (Name, Lyrics, Art and Subject) use only the appropriate single index. In Figure 1 a diagram of the relational database schema is shown. Lyrics, a particular challenge, are received in text format separate from the MARC records or images. A lyric database loader, hosted in MS-Access, processes them. After checking for primary key conflicts and other problems, lyrics are stamped with load date and batch number. A VBA program then examines lyric records for each piece of music and assigns a display sequence for the entire set of lyrics. The Web scripts to determine display order use this sequence. The lyrics are then added to the SQL Server database through an ODBC connection. Problem lyrics that cannot be resolved automatically are queued for music authorities at the Bagaduce Library to resolve the conflicts.

Creating access to a digital music library Database design and indexing We built a searchable database on the Web using Microsoft SQL Server 2000 and standard query language for compatibility with other institutional software (detail in Appendix 2). In addition to the

User interface The indexing that supports the search options on the Web interface available to users provides for Keyword, Name, Subject, Art and Lyric searches. Browsing options are by LC Subject (MARC 650), Local Subject (MARC 653) and Art Subject (MARC 650 Ind 2 ¼ 7) and Collection (MARC 001 [BML #]). Browsing is handled using the appropriate subject table because browse terms are not user-entered, but displayed in lists drawn from the database; consequently we do not have to be concerned with user spelling. The design of the interface identifies page functions and menu requirements while keeping

287

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Figure 1 MMB relational database schema

the code behind the pages minimal, for long-term ease of maintenance. Menus are simple and direct, and navigational indicators just above the page content keep users informed of where they are in the sometimes-complex process of searching and evaluating. The interface pages were created as HTML templates, and transformed into Active Server Pages (ASP) scripts for dynamic search and display functionality. Cascading style sheets (CSS) are used for presentation purposes as much as possible in order to lower development and maintenance costs. Figures 2 and 3 are screen shots of the top level of the MMB Web and the opening search screen to the collections. Users have access to the collection of sheet music and scores in several formats. They may browse and display images of published sheet music and scores; listen to notation translated into MIDI format for playback; view translated notation within a Web page and simultaneously listen to playback as a cursor follows the sound (Scorch format). Plans to associate a video performance and/or sound recordings for selected musical performances are under development. Thumbnail images are linked to larger images sized to fit the browser window. Links to the individual pages of music display at the top of the enlarged image as well as a link to the full size images (which may display larger than the browser window and may require scrolling). Images and associated lyrics and sound files do not display for music published after 1929 that is not in the public

domain. MIDI and Scorch files may be associated with sheet music or scores that were published prior to 1929. A computer-generated MIDI file requires an MP3 compatible player to listen to associated MIDI files. Most Web browsers come bundled with an MP3 player, but other players are available as free downloads. Some sheet music and scores may provide a link to a “Scorch” file that requires a compatible plugin in order to listen and interact with the associated file. Selected scores digitized by faculty request include streaming video files of live performances and require a Quick Time plugin. Actual sound recordings, like the video files, will be linked on request.

Designing an instruction channel The greatest technical challenge for the project continues to be the development of an interactive educational channel that provides an online instructional tool keyed to the database for listening and analysis. By itself, the MMB is a resource that has the potential to change the way in which educators approach the teaching of music, and to affect the way in which students learn. The instruction channel on the MMB Web leads to an online instruction tool keyed to the database. It is intended to support a variety of approaches for meeting state and nationally mandated standards required for middle and high school students (Maine Learning Results, National Standards for Arts Education-Music). The interface allows the

288

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Figure 2 Top level of the MMB Web site

instructor to select a list of music and design a “lesson” around it with specific assignment directions. (Figure 4 is a screen shot of the instructional Web module). The module is password-protected and provides access to music that is still under copyright and cannot be displayed in searches from the public interface. After the instructor creates lessons, there is a button on the administrator panel that e-mails URLs to the instructor for distribution to students. Lessons are made available to students through a login created by the instructor that provides access to the lessons and music using these URLs. (Figure 5 is a schema of the instructional channel database). The design for the module had to be a practical tool, flexible enough to allow for lesson plans from a number of different disciplines, and simple enough for users with moderate computer skills. The development process had to use functions that were available, given the programming languages, technology and project budget. While the educational module is a instructional software at an elementary level, its integration with the sheet music and scores to create lessons has applications

across a number of disciplines, including history, social life and popular culture, and graphic art, lending itself to meeting the goals inherent in state and national standards requirements for: Education – Music, Visual and Performing Arts and Creative Expression, Social Studies/History, Cultural History, Performing a Varied Repertoire of Music. Very early on in the project we realized that changing the way in which educators approach the teaching of music, and affecting the way in which students learn, necessitated access to a collection of music from a range of musical styles and types, as well as several formats. With this understanding, we enlisted assistance from several music faculty and created digital collections on demand for their use, including associated files in MIDI, Scorch and video formats. At this point in the project, it is not clear whether the music digital library will alter their form of instruction, or how they will interact with their personal digital library, or what effect this instructional tool will have on learners. A number of observations about the impact of the instructional tool are apparent but untested.

289

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Figure 3 Opening search to MMB collection

.

.

.

.

.

The digital library can provide access to a limited number of difficult-to-find pieces of music and make them available to a wide audience of students and instructors. Students can play in unison with the sound, evaluating pitch and adjusting speed. With Scorch files, the tracking of the playback on the music develops sight-reading ability, while the ability to transpose key signatures and chord symbols automatically enhances the visualization process. Music for one instrument can be transposed for another, thereby enlarging the repertoire for multiple instruments. Students can compare the computergenerated files with an actual recording, and come to understand more readily the subtle inflections of phrasing, articulation and expression, that is, how changes in tempo and rhythm accent artistry.

The instructional strategies possible can provide learners with an integrated music experience. Playing one’s own part, reading the score, active listening, live performance and orchestration are blended to give the student the whole picture of the music experience. This level of integration often takes years of experience to cultivate in the mind of a music student.

Future of the maine music box The last eight months of the project will be occupied with a variety of evaluation activities to measure the effectiveness of the MMB[12] in meeting the identified objectives (Figure 6): . to create a digital music library of manuscripts, sheet music, and scores in order to preserve and expand access to the collections; . to build a music digital library learning environment by developing a tool that supports teaching and instruction; and . to evaluate the educational impact of the MMB on teaching and learning. Plans include a series of focus group meetings to introduce the MMB and solicit feedback from teachers at selected middle and high schools around the state; working with the Maine Alliance for Arts Education and the Maine Chapter of the Music Educators National Conference; a series of user testing focus groups to evaluate the interface design; discussion and analysis of the ways in which university music faculty make use of the digitized collections for their specific curriculum, and how students respond to the digital library. Other methods of understanding the usage will include a user satisfaction questionnaire and analysis of session logs.

290

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Figure 4 Screenshot of interface instructors use to create online assignments

Figure 5 Database schema for instructional channel

291

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Figure 6 Example of assignment

7 VTLS, Inc., available at: www.vtls.com 8 IMLS NLG Collection Registry and Item-Level Metadata Repository, available at: http://imlsdcc.grainger.uiuc.edu/; http://imlsdcc.grainger.uiuc.edu/oaiprotocol.htm 9 OAI Sheet Music Project, available at: http://digital.library. ucla.edu/sheetmusic/ 10 Sibelius, available at: www.sibelius.com/ 11 MARC Breaker, available at: www.loc.gov/marc/ makrbrkr.html 12 Maine Music Box, available at: http://mainemusicbox. library.umaine.edu

Beyond the project timeline and budget, plans are underway to continue the digitization of unique portions of the Bagaduce Library collections, and to develop a collaborative strategy for ongoing processing, cataloging, and technical support needed to host and provide access to the digital music library. A plan for the digitization of a wider range of music types and styles, along with links to associated files (sound or video recordings, text documents [biographies, dictionaries, reviews, critical comments]), will depend on the outcome of the evaluation process for the current project. Critical to this development will be interest from music educators in further realizing the potential of the digital music library.

References Notes 1 Fogler Library, University of Maine, available at: www.library.umaine.edu/ 2 Bagaduce Music Lending Library, available at: www.bagaducemusic.org/ 3 Music Library Association Sheet Music Collections, available at: www.lib.duke.edu/music/sheetmusic/ collections.html; Music for the Nation; American Sheet Music 1820-1860,1870-1885: www.memory.loc.gov/ ammem/mussmhtml/ 4 Lester S. Levy Collection of Sheet Music, available at: http://levysheetmusic.mse.jhu.edu/ 5 Variations Project, available at: www.dlib.indiana.edu/ variations/ 6 Bangor Public Library, available at: www.bpl.lib.me.us

Choudhury, G.S, Tim, D., Droetboom, M., Ichiro, F., Harrington, B. and MacMillan, K. (2000a), “Optical music recognition system within a large-scale digitization project”, paper presented at Music IR 2000: International Symposium on Music Information Retrieval, October 2000. Choudhury, G.S., Cynthia, R., Ichiro, F., Tim, D., Elizabeth, W.B., James, W.W. and Brian, H. (2000b), “Digital workflow management: the lester S. Levy digitized collection of sheet music”, First Monday, Vol. 5 No. 6, available at: http://firstmonday.org/issues/issue5_6/choudhury/index. html Dunn, J.W and Constance, A.M (1999), “Variations: a digital music library system at Indiana University”, DL99: Proceedings of the Fourth ACM Conference on Digital Libraries, Berkeley, CA, August 1999, pp. 12-19.

292

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

Further reading Dunn, J.W. (2000), “Beyond VARIATIONS: creating a digital music library”, paper presented at Music IR 2000: International Symposium on Music Information Retrieval, October 2000.

Appendix 1. The music box collections Vocal, popular sheet music collection – consists of over 16,500 pieces of popular American music representing the many vocal styles from the late 19th through the 20th century. While the collection spans the years 1865-1990, the strength of the collection is in music published between 1920s and late 1990s. The collection has been cataloged and organized under 26 topics that describe different aspects of American life: romantic love, broken hearts; transportation cars, boats, railroads; the sea ships, harbors, lighthouses; patriotic war, elections, peace; geographical places rivers, mountains, countries; humor; holidays, stars of musical theater and film; period music blues, jazz, ragtime, waltzes, marches. These songs, together with the illustrated, color sheet music covers (engravings, lithographs, photographs) are a valuable resource for the history, social life and popular culture of America. Parlor Salon collection – consists of 3,569 scores organized in three unique collections: Vocal Parlor/ Salon, Piano Parlor/Salon and Violin Parlor/Salon. This music was composed, published and widely played from the mid-19th century (pre Civil War) until approximately World War I. It was performed in homes (parlors) and salons in intimate circles all over the world. This genre of music is generally unavailable to the public, and only a few pieces have survived in private collections. Like the Vocal, Popular Sheet Music Collection (see above) this collection also reflects American society, culture and history. The collection is also an important resource for music history, serving to help students develop a better understanding of the progression of the art of composition and the mastery of the then traditional musical instruments. Music for two pianos eight hands – consists of 223 rare scores that are out-of-print and/or out-ofcopyright. These pieces are almost entirely arrangements of orchestral and chamber ensemble works. The scores are highly sought-after by teachers, and students value them as a vehicle for learning ensemble techniques, as there is little opportunity to practice them otherwise. The music is also in great demand by amateur musicians who enjoy playing major instrumental works. Maine collection – consists of over 2,200 pieces ranging from 1845 to 1997 and includes keyboard,

choral, vocal and instrumental music. It is the largest known collection of music by Maine composers or about Maine. The collection originated at the Maine State Library and was donated to the Bagaduce Library. A rich tool for developing Maine ties in the school music curriculum, this collection is also of significance to scholars of Maine’s history. Haywood Jones collection (Bangor Public) – consists of 28 original manuscript scores of primarily marches and school songs, composed by Haywood Jones for local high school bands in Bangor and New England Jones was an amateur musician and composer whose popular marches and school songs continue to be played by bands throughout the region and nationally. The “town band” is a New England tradition and one that still thrives in Maine’s communities. The preservation of these scores is important to the social history of the region, and access to them would benefit a broad community of users.

Appendix 2 (1) File specifications . Bitmap: 300 dpi 1-bit . TIFF: 300 dpi RGB . JPEG: 72 dpi RGB . Thumbnail: 115 by 150 pixels . Lyrics: ASCII text (2) Metadata – Descriptive Metadata . MARC Format . Dublin Core Metadata Element Set (3) Authorities . Library of Congress Authorities . Library of Congress Subject Headings . Library of Congress Thesaurus of Graphic Materials . Art and Architecture Thesaurus

Administrative metadata Filename, resolution (pixels), bit depth, image height (pixels), image width (pixels), ICC color profile, date/timestamp, BML id. System hardware The MMB equipment is configured with 1.44 TB of usable space: Lian Li Aluminum server case, Intel P4 2.6 ghz, 533FSB, 1.5 GB RAM, 2-120 GB system disks in a mirror set, 10-Seagate Ultra 320 SCSI3 disks 147 GB each arranged in a RAID5 array, 1.47 TB array set (1.323 TB usable), total usable space (system disks plus data array) 1.44 TB, CDROM, floppy, keyboard, monitor, Windows 2000

293

The Maine music box: a pilot project to create a digital music library

Library Hi Tech

Marilyn Lutz

Volume 22 · Number 3 · 2004 · 283–294

server SP4, Internet Info Services 5.0, MS SQL Server 2000 SP3, Veritas Backup Exec with SQL Server Agent, Quantum DLT 40-80 GB SCSI tape drive. External drives for data transfer: Western Digital 200 GB external disks with Firewire and USB2 interfaces. DVD production Lian Li All aluminum workstation tower with six cooling fans and 450 W power supply. Intel D845PESV 533 MHz motherboard, Intel P4 2.4 GHz CPU, Sony DVD+-R drive, Teac DVD-ROM, Adaptec 1200 ATA Raid controller, ATI Radeon 8200 AGP4 video controller w/64 MB RAM, 512 MB Crucial SDRAM 6-120 GB Ultra ATA100 Seagate EIDE hard disks, Teac 1.44 MB floppy disk, Iomega 250 MB zip drive. All assembled, tested, and installed at

Figure A1 Vital Statistics and off-the-shelf system software

Fogler Library by staff members. Workstation software was composed of Windows 2000 professional, Office 2000 Professional, and DVD mastering using Ahead Nero V 6.0 (Figure A1).

294

Introduction

Enabling technologies and service designs for collaborative Internet collection building Steve Mitchell Julie Mason and Lori Pender

The authors Steve Mitchell, Julie Mason and Lori Pender are all based at the University of California, Riverside Library, California, USA.

Keywords Collecting, Internet, Classification, Portals

Abstract The following describes a number of technologies and exemplary service designs that foster better Internet finding tools in libraries and more cooperative and efficient effort in Internet resource collection building. Our library and partner institutions have been involved in this work for over a decade. The open source software and projects discussed represent appropriate technologies and sustainable strategies that will help Internet portals, digital libraries, virtual libraries and library catalogswith-portal-like-capabilities (IPDVLCs) to scale better and to anticipate and meet the needs of scholarly and educational users.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 295–306 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560099

Over the last decade we have acted upon the belief that developing a portal or virtual library that would provide significant coverage of important Internet resources for researchers and students would be a large task requiring a cooperative, multi-institutional approach. We also believed that academic library Internet collections, while differing substantially in some ways, tend to include many of the same high-quality, core resources. To have thousands of libraries building very similar collections of links and related metadata would be wasteful and, moreover, would result in collections that would be much less substantial and useful than collections built through multi-institutional and combined effort. In looking for synergy and reduction of redundant effort, we have supported, through our INFOMINE, iVia and Data Fountains projects, multiple efforts and approaches involving multiple institutions and projects. In this paper we will discuss these projects and collaboration enabling technologies (and related innovations in systems, services, metadata, and collection design), which support more efficient, more cooperative collection building and better user access. We will introduce the projects with which we have been involved, discuss relevant general trends and new information environments and possibilities, and review the general approaches, technologies, methods, designs, software and algorithms that are utilized in our software.

Pioneering projects based in collaborative technology INFOMINE description and goals The INFOMINE[1] virtual library service (established in 1994) has the mission of identifying, describing and making visible to the academic community the significant scholarly and educational resources on the Internet. It contains over 115,000 links to resources in multiple subjects and represents the collaborative efforts of many librarians and faculty at the University of California (Riverside, Los Angeles, Santa Cruz, Davis and Irvine), Wake Forest University, California State University (Fresno and Sacramento) and the University of Detroit (Mercy). We have begun collaboration with the National Science Digital Library, the Library of Congress, and the University of California Shared Received: 30 January 2004 Revised: 8 April 2004 Accepted: 12 June 2004

295

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

Cataloging Project to share content. The collection is a hybrid one (Mason et al., 2000) consisting of metadata that is created by subject experts, machine processes, or machine processes with expert refinement. INFOMINE is used for end-user searching as well as for collection development on the part of other Internet portals, digital libraries, virtual libraries and library catalogs-with-portal-like-capabilities (IPDVLCs). INFOMINE was conceived as a multiinstitutional, collaborative effort. It uses iVia software as its system platform. INFOMINE support has come from the US Institute for Museum and Library Services (IMLS), the National Science Digital Library (National Science Foundation), the Fund for the Improvement of Post-Secondary Education (US Department of Education), the Library of the University of California, Riverside and the Librarians Association of the University of California.

classification software that assigns Library of Congress Subject Headings (LCSH) and Library of Congress Classifications (LCC). iVia development is supported by the US Institute for Museum and Library Services, the National Science Digital Library (National Science Foundation), and the Library of the University of California, Riverside.

iVia description and goals iVia[2] is an open source portal or virtual library collection-building software platform (Mitchell et al., 2003). Figure 1 is a graphical overview of the functional design of iVia. It was designed to support multiple institutions and projects in cooperative collection building efforts. The system is used by INFOMINE and the National Science Digital Library (NSDL), among others. Written in C++, it features a very large number of custom-configurable user interface and information retrieval options. For content builders, multiple metadata creation options, including support for multiple “production lines” and levels of editorial control, are available. Machine assistance, to semi- and/or fullyautomate a number of tasks, is featured. Automated resource identification is made possible through innovations in focused crawling. Automated metadata generation involves

Data fountains description and goals Data Fountains[3] will be an open source software system and a service for automated or semiautomated Internet resource discovery and metadata generation. Currently under development, it is based in the iVia system but expands beyond this by creating an array of collection building systems for cooperating projects with the goal of generating the basic “ore” (awareness of and links to important Internet resources together with associated metadata records and rich full-text) for these projects. Figure 2 shows an overview of the Data Fountains System. In the Data Fountains model, each collaborating project and/or subject community will work with its own focused crawler and classifiers. Expert interaction is designed into the system and is crucial in developing, fine-tuning and significantly improving the performance of the system. Data Fountains work is supported by the US Institute for Museum and Library Services and the Library of the University of California, Riverside. Collaborative, participatory technology The technology that the INFOMINE, iVia, and Data Fountains projects rely on has been designed to enable and facilitate cooperative service and collection building. That is, the underlying technology of the iVia and Data Fountains systems Figure 2 Data fountains overview

Figure 1 iVia Overview (Mitchell et al., 2003)

296

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

is designed to be cooperative and participatory and to support many modes of collaboration. The machine learning components actually gain significant increases in accuracy through expert input from collaborators in the form of training, “truing” and refinement of machine processes. While both fully automated and fully manual processes for collection building are supported, semi-automated processes involving interactive subject domain expertise are emphasized. Both Data Fountains and iVia projects are concerned with developing enabling technology that amplifies, augments and makes expensive expert effort more efficient while, at the same time, employing this expertise to interact dynamically with and improve performance of automated systems. iVia and Data Fountains are examples of IPDVLC communityware. The remainder of this paper focuses on the underlying concepts of IPDVLC communityware and the development work that went into building our systems.

collections; and modular and recombinant data, collections and finding tools. Collections and collection building processes must therefore be designed with this end-user centered context and new functionality in mind. Major components of this new information environment include the following.

Innovative service, metadata and collection designs for cooperative Internet collection building Cooperative collection building Efforts to encourage cooperative collection building have been moderately successful to date even though organizational barriers to participation have been and remain substantial. Fortunately, this is changing as many of the challenges the library and academic community face have intensified to the point where managers are recognizing that insularity and competition are not productive. The best way to compete, or at least persevere, as an organization, may in fact be through intensified cooperative effort. Time and improvements in technology are facilitating cooperation: computing power and related expertise is less expensive; scholarly and educational Internet usage is assuming more regular forms and service patterns; and librarians are increasingly technologically savvy. Finally, ever-eroding library budgets are the most powerful catalyst for encouraging cooperation. Cooperative collection building increasingly takes place within the larger context of search or finding tools that feature, at the core of their design, accommodation of multiple, diverse end user needs, intent and skill. Increasingly finding tools (and the collections that support them) are or will be emphasizing and benefiting from such developments, discussed below, as: dynamic landscaping (Nicholson, 2002); multiple data views and means of customized/personalized access; shared, heterogeneous data and

Cooperatively created, multipurpose metadata and data Metadata should be increasingly multipurpose and flexible in intent and support a wide expanse of possible data views. There are a number of considerations that collection builders need to reflect upon in designing their collections and metadata with maximum collaboration and multiple uses in mind. While it is hard to anticipate newly developing metadata needs and usages, projects should strive to be as inclusive as possible within economic constraints. The development of multipurpose metadata, conceived expansively, may have to rely on automated or semi-automated metadata generation, before becoming affordable and widespread. More accurate automated translations and mappings from natural language to and among metadata subject schema could play an important role here (Wake and Nicholson, 2001). INFOMINE/iVia supports versatility in metadata usage by including parallel fields for annotations (to support both short and long annotations), subjects and other fields. Also included are fields for resource format and subject research disciplines, which are not used by the INFOMINE service for the public at this time, but which should prove to be of value to others. Metadata should be conceived in a modular manner. It should be easy to swap between subject schema depending on the user audience accessing a search service. For example, LCSH and other library-standard subject schema (while providing a wonderful generalized language for describing the widest spectrum of subjects) are often seen by increasingly specialized scholarly communities as too generalized or alien from community vocabulary to represent their interests well. Specialized subject searches would be more effective if done in more specialized subject schema rather than in a generalist subject schema. Generalist subject schema, on the other hand, can provide important background context and are useful for making connections across multiple specialized communities. In this vein, it has been important in INFOMINE to assign natural language keyphrases and annotations and to make selected full-text (from the resource itself) available and searchable. Natural language text search approaches can often counteract retrieval problems associated with

297

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

the limitations of generalist and controlled vocabularies. We also recognize that in some uses of the metadata that are shared with other projects, our choice of emphasized subject schema may not be appropriate. Of great value in this situation is the realization that the natural language text we identify, extract and apply is important baseline data relevant to a great number of applications and projects. Other IPDVLCs may, for example, be able to take this text data (e.g. keyphrases and annotations as well as selected rich, full-text from the resource) and reprocess and/or refine it, through their own manual or automated processes, into the subject or other metadata of their choice. For example, augmenting library catalog records with rich full-text would be a useful aid to user retrieval given that this type of metadata has the problems associated with generalist subject schema as well as, typically, a very limited number of thematic access points for user retrieval. This is an application of full-text that we are currently exploring. The natural language text we identify, extract and/or apply is quite important in and of itself, beyond the value of the metadata generated from it. Increasingly, managing multiple incoming metadata streams is also a necessary prerequisite in cooperative finding tool collection building, as many projects share and augment their collections with metadata from other cooperative collections or sources. Metadata within INFOMINE comes from a number of sources. It is manually or semiautomatically created by multiple institutional participants within the project. Manually-created metadata is imported from other collections including virtual or digital libraries (as is the case with National Science Digital Library metadata) and standard MARC cataloging efforts (as mentioned, INFOMINE imports, translates and includes metadata from the University of California’s Shared Cataloging Project). Finally, INFOMINE includes automatically-created metadata from three crawlers (an expert-guided crawler, a virtual library crawler, and a focused crawler, described further below) and a number of metadata generation, or classification, programs (i.e. classifiers). Data Fountains, when operational, will provide and receive multiple metadata streams to and from a number of participating subject communities. Strongly implied in the notion of cooperativelycreated metadata and collection building is the ability to accommodate diverse means of metadata creation. Cooperating projects and subject communities often have evolved their own ways of doing things, and these methods need to be supported, within reason. If the net product of

multiple cooperative approaches results in metadata that shares a critical number of core fields then everyone gets benefited. To this end, software known as “Theme-ing” software has been developed for cooperators in content building (it is also used in user interface building). This software makes the building of customizable, modular, template-based interfaces for content creation and editing possible. These interfaces are constructed to suit the differing content-building needs and traditions of cooperating projects. Projects differ in their workflow needs (e.g. some require many tiers of editorial review while others require fewer) reflecting differences in divisions of labor, amount of labor available and organizational culture. For example, while most library catalogs rely almost exclusively on trained catalogers, many virtual libraries rely on public services librarians in assigning all metadata, including controlled subjects. Still other virtual libraries augment with catalogers (working with public services librarians) to assign the more exacting controlled subject terms. Customizable content builder interfaces and the many approaches to metadata generation supported all serve to increase collaboration. Notably, by having associations with other projects and varying methods, comfort zones and then convergences develop. Metadata sharing characterizes cooperating projects of the INFOMINE type. While many INFOMINE cooperators share the same iVia backend (i.e. core databases and system) and do not move data around, one of iVia’s major features is the ability to move data flexibly among differing projects and databases. Data format and transfer standards supported include Dublin Core (DC), including DC variants with added fields, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Standard Delimited Format (SDF). OAI-PMH is used as an essential part of iVia and Data Fountains architecture to move data among INFOMINE databases (e.g. the crawler fed databases are harvested by the main database via OAI-PMH). OAI-PMH is used to share records with NSDL. Uniformity is also gained through usage of important library-standard subject schema such as LCSH, LCC and, shortly perhaps, Dewey Decimal Classifications (DDC). These are important because they are the subject description languages used in most US libraries and this bodes well for providing a common finding tool that will provide a uniform, consistent subject access bridge to both Internet resources and print record in library catalogs. It is disappointing that these standards remain proprietary (with an open model they might be more widely used both globally and,

298

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

more critically, outside of the library community). The iVia and Data Fountains projects will be developing or improving classifiers for each standard at the same time that work is undertaken to improve often inadequate mapping approaches.

the collections can be searched together. Additionally, cooperating institutions can create very specific canned searches (i.e. stored searches that are linked and execute when clicked) of any combination of records for users. Crucially, we feature a field which allows multiple projects to put in their own custom data elements to support particular data views. Some of these are for limited and/or short-term needs, such as supporting a faculty giving a course by providing a brief “Webliography” of Internet resources on the subject, as generated from INFOMINE, at the course’s site (e.g. records relevant to a Geology 110 course receive the data element UCRGeol110 in this field for the 12-week period of the course). Others may wish to add elements to this field to function as major filters in all searches where, for example, a cooperating institution wants its audience to only view the fee-based resources to which it, not others, subscribe. iVia and Data Fountains support dynamic landscaping and recombinant metadata and collection building for IPDVLCs and their users by supporting the creation of a controlled number of parallel metadata fields. Parallel fields for metadata enable differing sets of related metadata (e.g. differing combinations of thematically oriented fields) to be flexibly and dynamically assembled for and become searchable in custom ways by users with retrieval and metadata needs that differ between search sessions or even between individual searches. Parallel fields help accommodate differences in data views and the depth, type and exactitude of metadata to be applied in specific types or groupings of fields. For example, portal Y may enable users to search short annotations and LCSH1 (cataloger vetted LCSH), while virtual library Z may broaden this to enable users to search long annotations, LCSH1, LCSH2 (public services librarian created LCSH), LCSH3 (machine created LCSH), Keyphrase1 (expert created) and Keyphrase3 (machine generated). These various sets of community-specific LCSH terms would exist as overlays of and share “common denominator” and/or “best” metadata (in this case LCSH1, with or without other LCSHx as needed) in a foundation record. What constitutes a foundation record and its common denominator, “best” metadata (upon which metadata that is perceived to be less important or less accurate could be overlaid), would vary depending on the data views and needs emphasized by the community(ies) involved. In most cases it would be that metadata thought to be critical and that is shared by members of multiple, usually allied, subject communities. Generally though, as a minimum for most if not all, the notion

Cooperatively created dynamic information landscapings Notions that are informing both current and future developments of our projects are many. Chief among these is that databases and collections, user interfaces and retrieval options, metadata, and even computing itself will be increasingly recombinant (Xerox Corporation Palo Alto Research Station (Parc), 2001; Lynch, 2001; Dempsey, 2003) – that is to say multipurpose, modular and dynamically configurable, in both intent and design. This notion goes beyond predominantly static portals and preconfigured assemblages of distributed collections accessible through meta-searching approaches with often limited, common denominator retrieval features. As the major components in a dynamic, usercentered information landscape (Nicholson, 2002), the information system building blocks mentioned will need to be modularly designed and capable of being combined, dynamically on the fly, in very diverse ways (many unanticipated by their creators). In this way, finding tools will be better able to accommodate user differences in information finding skills, intellectual levels, intent of search (e.g. commercial, educational or research), culture, language skills, ability to pay for resource usage, and subject focus, among others. In pursuit of delivering a much richer user finding experience and optimizing access to the ever expanding riches of increasingly diverse metadata and collections, the term “recombinant” strongly implies that more fully featured, intelligent, dynamic interfaces, backends and middleware (to broker among collections and interfaces) will be required. Additionally, the library community will have to think beyond portals as institutionallyconstrained gateways to relatively static assemblages of resources and data views, more effectively pool resources and develop collaborative effort, and more effectively develop and employ resource saving means of machine assistance to help in these not inexpensive tasks. For end-users, at this point, INFOMINE/iVia remains relatively static though it is capable of flexibly supporting many portals, interfaces or brands. Currently it lends itself on a basic level to dynamic landscaping and recombinant approaches given that subject collections can be linked to or searched separately or as a whole, either through the native interface or others. Any combination of

299

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

of title as well as author or creator would be present in all foundation records as would a subject approach. For most US academic libraries, in following the example above, LCSH1 (cataloger vetted) would be seen as an inherent part of any foundation record while LCSH3 (machine created LCSH), keyphrases, and descriptions might be seen as useful overlays (to augment and round-out), but not as primary. Specialist science communities, on the other hand, might not use LCSH in any form, it being perceived as a language of generalists, but instead emphasize annotations, keyphrases and specialized vocabularies. Also of value to the recombinant concept from the content builder’s perspective is that foundation records contain generic textual data (natural language keyphrases, annotation text and rich fulltext), in addition to controlled metadata. This text, as mentioned earlier, is useful both as is and in supporting projects using new classifiers to populate new subject schema to meet new metadata needs. Our metadata model increasingly anticipates the development of interfaces with support for dynamically changing information landscapes and data views and differing mixes and layers of metadata (foundation records augmented by various combinations of overlays) which can be assembled in diverse ways instantaneously and interactively or be pre-configured statically. Flexible, modular, recombinant interfaces are necessary to support dynamic landscaping and provide access for users to fluidly changing constellations of collections and metadata elements (Xerox Corporation Palo Alto Research Station (PARC), 2001; Kerne and Sundaram, 2003). Such interfaces will further develop and improve personalized access and interest profiling for individuals[5] as is increasingly practiced (though often within narrow ranges of relatively pre-configured subsets of data views and access) in many institutional portals (Dempsey, 2003). Recombinant interfaces will eventually go beyond this by employing machine learning techniques to encourage interactive, personalized user dialogue to intelligently detect the most promising finding features as well as the optimal mix of resources and metadata for the user. Interfaces could reassemble themselves dynamically on the fly in response to user choice (e.g. browse options might be emphasized over search, the visually impaired would see large fonts, query translations could occur for searching collections in multiple languages, Spanish could replace English in the interface). In doing so, the interface would work with each user to assemble the ideal interface for the situation. These interfaces are on our drawing boards.

INFOMINE, iVia and Data Fountains currently emphasize a maximum choice, flexible user interface that features a very large set of options for users on both individual and institutional levels. These include but are not limited to: fielded and full-text searching through Boolean and proximity operators, nested Boolean searching, searching any combination of general subject or field; limit searches using filters for, by way of example, resource types or institutionspecific data views; and multiple formats for displaying differing amounts of metadata. Though static at this time for the user, these interface capabilities are recombinant options for institutional cooperators through the “Themeing” software described earlier. INFOMINE and iVia search, browse and display capabilities can be assembled in unique interfaces that support the presentation (i.e. “brand” and “look and feel”) and information finding needs of cooperating organizations. “Theme-ing” provides support for institutional identity management. The net result is that an institution can employ and share INFOMINE’s information retrieval and display capabilities and/or cooperatively created metadata and have it appear as if these are emanating from the specific institution and its portal. The interface for California State University, Sacramento, is an example of this[4]. Hybrid collections of heterogeneous data In INFOMINE, as mentioned, there is support for metadata that is created by experts from multiple institutions within the project as well as for metadata imported from other collections. Support is also available for metadata that has been created wholly automatically by crawler/ classifiers as well as for metadata that is created automatically then refined and augmented by experts. The INFOMINE collection is designed as a two-tiered, hybrid collection. In this approach, the collection of machine-created records is meant to support another wholly expert-created collection that is generally more accurate, but which is comparatively labor intensive, expensive to create and maintain, and more limited in breadth or depth. The INFOMINE hybrid collection model features multiple collections of records of distinctly different origin and somewhat varying fields and data elements. This hybrid collection model has encouraged cooperation among projects with metadata of varying types.

Collaborative service design and harnessing new technologies INFOMINE has supported librarians working together in a united effort in virtual library service

300

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

and systems building for over a decade. The service has been cooperatively created and is shared by a number of institutions. Users and cooperators are offered a great variety of metadata and full-text searching options as well as choice in interface development. Metadata building processes that emphasize cooperative approaches, shared and heterogeneous metadata from multiple sources and the ability to support unique needs and data views are featured and supported by the iVia software platform upon which INFOMINE is based. Data Fountains is currently under development and will be an Internet collection building service and software platform that consists of a resource discovery capability as well as metadata generation and rich full-text harvesting capabilities. Like INFOMINE, the Data Fountains service is being designed organizationally as a cooperative of sharing organizations. Data Fountains differs though in that it is not specifically concerned with cooperator or end-user interfaces or searching, institutional identity management, individual finding tool end products, or specific portal mixes or assemblages of collections. Rather, it is a generic service (and underlying software) that uses machine learning based technology to facilitate federated efforts at content building by identifying useful Internet resources and generating the generic metadata and full-text products valued by most IPDVLCs. As iVia does for INFOMINE, the Data Fountains system will enable the creation and sharing of collection building technologies and processes on the part of assemblages of cooperating projects with the difference that the metadata and full-text produced will reside in entirely external and independent systems and projects. Data Fountains also differs somewhat from iVia in that there is more of an emphasis on machine assistance to experts and expert truing of machine processes within the context of semiautomated resource discovery and metadata generation. The INFOMINE service, iVia software and Data Fountains (both as service and software) are serious efforts to share not only the products of advanced machine learning and information technologies but also the technologies themselves. These projects have attempted to provide this technology, which is not generally available and is expensive to develop, both in modular components and as integrated systems. The technology underlying these projects supports traditional manual, expert-based metadata creation as well as fully- and semi-automated means of metadata creation. It is the technology that will help make possible the richer information environments of the future. The hope is that the

information retrieval, crawling and classification, import/export, record building, interface, communityware and database management software developed as open source code and the services related to these will function to introduce and sustain the cooperation engendering and scaling technologies that IPDVLCs (especially those in the public domain) will need in order to survive and thrive. Characteristics of and constraints facing collaborations, cooperative organizations, and libraries Innovative, collaborative technologies and design facilitate and are nurtured by innovative, cooperative organizations and collaborations. We have observed that these organizations and collaborations are characterized by: relatively flat, non-hierarchical management structures; flexibility; simplicity; a high degree of interaction and communication; modularity and multipurposiveness; non-aversion to risk taking; pursuit of mutually efficient effort or even synergy; and, the ability to work with and co-evolve with other organizations. Though this can change, in our experience there are many organizational challenges libraries are facing which are barriers to innovation and impede the development of new technologies, cooperative efforts, and better information tools for users. It is our belief that currently, many new, innovative, public domain, Internet-based services related to academia do not map well to traditional library and other academic organizations; though they do map very well to the medium (in which they were born) and its potential. New organizational and service models must therefore be developed in libraries which better map to the capabilities and opportunities inherent in the new medium. Scholarly, Internet-based communication may reach its potential only if new organizational structures and approaches are realized that expand beyond traditional models of the physical library and the print-based collection. Enabling technologies in support of collaborative effort and designed to amplify expert effort Given the size of the Internet as a medium and the great number of significant resources available, existing academic Internet collections are the best representative in coverage and are only partial efforts in what should be a cooperative assemblage of much larger, integrated collections. The task of providing roads, highways and generally thoughtful access to the “intelligent Internet” in as comprehensive and objective a way as possible (while adhering to traditions in library-based

301

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

collection quality) in cooperatively building one or more Academic Resource Webs, is not a project that the library or academic community should cede over to commercial interests. The projects described have been working in pursuit of, and have made good progress in achieving, a significant public domain Academic Resource Web over the last ten years. Over the last five years, they have been concerned with contributing to the technologies that should help such an effort to scale. Machine learning based technology is maturing rapidly in areas of application relevant to Internet finding tools and is familiar to and used daily by many in academia through Google (which employs automated, intelligent crawling for Internet resource discovery). Automated resource discovery and classification technology is very promising and starting to yield important breakthroughs that, we believe, will affect libraries in many important ways. In this vein, it should be noted that Google started as an academic research effort and much of its technology was developed (and continues to be developed) by academics. In the following, we feel that it is crucial that those of us in academic libraries avail ourselves of machine learning technology and the related expertise in academia that surrounds us and can be organized to work with us. It is important to harness and chart our own course with this technology, if we expect to meet the finding tool needs of our users during the “Internet Age”. Currently though, many of us rely on Google’s charity. This comes without any guarantee of future support. The assumption is that their business model will continue to be as it has been in perpetuity. Many of us assume their search results placement is objective. It often is not because of such phenomenon as paid for results placement and “Google-bombing” (faked usage to increase a site’s prominence among Google’s ranking algorithms). We assume that their particular machine learning technologies and algorithms will continue to work for an almost infinite number of audiences and resources. With a lot at stake, these assumptions are questionable. Given this, we feel it is important (and accomplishable) to develop or adapt relevant parts of this type of enabling technology to work in a sustainable, appropriate technology manner that is suitable for academic finding tools, and which maximizes the unique capabilities and expertise the library community has to offer in terms of subject knowledge and collection building. A finding and collection building tool that utilizes machine assistance and machine learning to augment and amplify librarian effort promises to

be a superb Internet finding tool. Such a finding tool would offer the excellence and objectivity required in academia, and which is characteristic of librarian collection building efforts, at the same time it would offer some of the reach and large scale coverage needed to represent the great number of relevant resources of value to the academic community. This is the promise that the iVia and Data Fountains work has been striving towards. The scale of the new information environment requires increasing roles for both automation and cooperative, expert effort. While looking to automate much of the collection building process, it is our belief that the main advances for academic finding tool applications will be in weaving librarian and other subject domain expertise into this technology. The iVia and the Data Fountains projects are plugging in the expert to true the technology while using the technology to expand and improve the expert’s effort. Machine/expert interaction is an area where large commercial operations cannot easily follow or serve us well. This is because libraries have the domain expertise to pursue and develop machine/ expert interaction on a continuing basis, which is critical to improving finding tools for academic research and education, and because academia is not, relatively speaking, a large or lucrative market. An overall architecture and design that supports modular software and systems has been designed into iVia and Data Fountains. In this way, these projects can anticipate and make quick usage of new technology and breakthroughs. Designing these systems so that the crawlers, classifiers, database management software and other components can be swapped out as needed will help keep iVia and Data Fountains evolving more easily. Modularity in design also helps maintain place markers for uses, components and collaborations not anticipated during design. It is important to realize that the iVia and Data Fountains software will be licensed as free open source software because we believe that this is a good strategy for sustainable software development. As open source, its continued development will be sustained through the efforts of IPDVLCs that wish to extend its capabilities for their own uses. Open source software also provides a good forum for developing fuller collaborations among like-minded projects having the same or similar software needs. iVia and Data Fountains include much technology supportive of groups and institutions that build Internet collections that otherwise could not avail themselves of it. Open source systems may eventually contribute to

302

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

helping these efforts scale by making efficient state of the art collaborations and services possible. Both iVia and Data Fountains systems are intended to be free IPDVLC communityware.

and through text similarity analysis. These areas of inquiry and software development are rapidly expanding frontiers in computer science research where great advances are being made. The rewards are much greater efficiencies and accuracy in the automated discovery of significant Internet resources. The NIFC work has and will continue to benefit iVia and Data Fountains. Improved preferential focused crawling has become an emphasis and addresses improving accuracy in automatically selecting the “better” links to crawl among all those available (i.e. the URL frontier) on a page. This involves an “apprentice” learning program (Chakrabarti et al., 2002) that intelligently detects clues in a resource, which a human user would notice, regarding which links are the most promising to follow (e.g. visually emphasized links, link placement on the page, anchor text and text windows around anchor text) (Chakrabarti et al., 2002; Chakrabarti, 2001; Glover et al., 2002; Menczer et al., 2004; Menczer, 2004). The concepts of reinforcement learning in focused crawling as well as soft focused crawling are being examined, e.g. relaxing crawls to overcome bottlenecks (Chakrabarti et al., 2002; Chakrabarti, 2003; Flake et al., 2002; Menczer et al., 2004; Rennie and McCallum, 1999). Expert interaction and/or semi-automated approaches to improve crawling are special research focuses for us, as mentioned above. Figure 3 shows some of the areas of expert interaction. Experts create and refine seed sets for the crawl by selectively choosing among seeds generated from their own collections and other sources. Expert feedback on crawler results is important (e.g. specifying positive and/or negative examples), either interactively or after a crawl, to indicate to the crawler the resources most relevant to a subject and to increase accuracy. Expert truing of crawler Web graph results (i.e. manually

Technologies for Internet resource discovery and identification A number of crawling systems are used in iVia and Data Fountains. The simplest is an expert-guided, directed crawler that is given a list of URLs and crawls these to specified depths. There is also a virtual library crawler that crawls a set of close to 1,000 expert-selected academic virtual libraries (i.e. intelligently organized subject directories of Internet resources). The most sophisticated crawler, and one that continues to be developed, is the Nalanda iVia Focused Crawler (NIFC). Focused crawling makes possible the focused, accurate identification of significant Internet resources within specific communities of shared subject interest (Chakrabarti, 2003) and represents an appropriately scaled approach for many library and academic community applications. NIFC is a program that crawls the Internet to find resources that are strongly inter-linked and which are part of, and contain content similar to, the same or related learning communities as those represented in INFOMINE and other significant academic IPDVLCs. The high quality data from IPDVLCs is often used in seed sets, or for training, in guiding the crawler. As the crawling progresses, an inter-linkage graph is developed of which resources link to one another (i.e. cite and co-cite). Good resources focused around a common topic often cite one another. Highly linked resources are evaluated, differentiated and rated as to the degree to which they are linked to/from as well as for their capacities as authoritative resources (e.g. an important resource such as a database which receives many in-links to it from other resources) or hubs (e.g. secondary sources such as virtual library collections which provide out-links to other, authoritative resources). After such assessments have occurred, a second automated process is then put into play which rates resources, as a second indirect measure of resource quality, by comparing for similarity of content (e.g. similarities in key words and vocabulary) between the potential new resources and resources already in the collection. The most linked to/from authorities and hubs, with terminology most similar to that in other high quality collections, thus become prime candidates for either adding to the collection as automatically created records or for expert review and metadata refinement. There are numerous algorithms and approaches for detecting relevant resources through co-citation or linkage analysis

Figure 3 Points of expert interaction

303

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

“lifting” the values of selected hubs and authorities) either during or after a crawl is being explored to improve accuracy. Truing crawler results through the use of tools to visualize the crawl, so that the expert can identify the most promising areas of a Web graph for the crawler to concentrate on are being examined. Expert community created “blacklists” of URLs for types of sites or pages that are not valuable save crawling time. There is such a blacklist for iVia and will be for each Data Fountain. Rules for determining sites that should be fully crawled as opposed to being sampled or only crawled at top levels are being developed as well to save crawler time. Generally, the crawler runs are examined with an eye towards the quality of the results with feedback going to improve the crawling process.

focused on improving its speed and scalability (Shih et al., 2002; Yu et al., 2003). Much of this involves combining it with faster algorithms which do pre-processing that help speed up the SVM. Current research on Naı¨ve Bayes classifiers is being examined since these, relatively fast algorithms are being improved in terms of accuracy (Rennie et al., 2003). Given that improved and larger sets of training data increase accuracy in most forms of classification, new means of using boosting techniques and partially labeled data for training classifiers are being explored (Ghani and Jones, 2002; Jones et al., 2003; Park and Zhang, 2003; Thelen and Riloff, 2002). Leveraging expert labeled training data with partially labeled training data using co-training techniques is among the areas being examined. This will allow us to exploit the large amount of potentially useful library catalog data available (Sarawagi et al., 2003). If this partially labeled catalog data could be used more effectively as training data, it would yield much greater accuracy in automated library-standard subject schema assignment. Classification is another area where INFOMINE, iVia and Data Fountains will increasingly use expert input from participants to improve the process and end results. Rules reflecting the semantics of resources in each major subject area will be developed by Data Fountains and INFOMINE experts for project crawlers and classifiers. Also being pursued are advancements in processes that improve the ground work for more accurate classification; notably rich text identification and extraction. An important goal in this is refinement of an “aboutness” measure (Mitchell et al., 2003) for use in identifying the most relevant pages in a resource or sections of a document which are intended by the author(s) to be “rich” in descriptive information about the topics within and from which accurate mining of more relevant keyphrases can occur. Involved is improved detection of the type and boundaries of Internet object encountered and better determination of author created structures and conventions in document and resource layout (e.g. introductions, summaries, etc.). Better, more accurate natural language keyphrase harvesting from rich text means more accurate automated application of controlled subject schema terms and annotations.

Technologies for metadata generation and rich text identification and extraction systems iVia and Data Fountains, as mentioned earlier, involve innovations and improvements in automated metadata generation including the identification and application of appropriate controlled subject terms (using library standard subject schema), keywords, and annotation-like constructs. Figure 4 shows the iVia metadata generation process. Automated classifier programs apply these and other metadata generation approaches and are part of a suite of programs known as the Record Builder. Controlled subject terminology applied currently includes LCSH and LCC. In assigning LCSH, a set of keywords and keyphrases is derived which serve as a surrogate in representing each Internet resource and which summarize the resource’s content. Then, using a model that encapsulates the relationships between natural language keyphrases and the set of controlled language terms making up LCSH, the closest corresponding set of LCSHs is assigned. The model is learned from training datasets from library catalogs and virtual libraries where both LCSH and keyphrase metadata are used to describe a given resource. With LCC the aim has been to assign one or more LCCs to a resource based on the set of LCSH associated with that resource (Eibe and Paynter, 2003). Support Vector Machine (SVM) algorithms have been used. The work has been successful, but accuracy could be significantly improved (Mitchell et al., 2003). A significant amount of the current research on classification algorithms that has occurred promises improvements in the quality of classification done. Work on SVM improvement is

Conclusion In summary, the iVia and Data Fountains open source systems function, from their foundations

304

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

Figure 4 iVia metadata generation

up, as model IPDVLC communityware. The INFOMINE virtual library service is a cooperatively built Internet finding tool for its collaborators and their end-users. The Data Fountains Internet resources identification and metadata generation service will produce metadata products for use by the full-spectrum of IPDVLCs. All these systems and services are the fruit of and based in collaborative and participatory technologies that will help libraries and others to cooperate better, to serve the information needs of their users better, and to scale better, given

the magnitude of the new information environments.

Notes

305

1 Infomine, available at: http://infomine.ucr.edu/ 2 iVia, available at: http://infomine.ucr.edu/iVia/ 3 Data Fountains, available at: http://infomine.ucr.edu/ Data_Fountains/ 4 California State University, Sacramento. Scholarly Internet Resources, available at: http://infomine.ucr.edu/csus/ 5 North Carolina State University. MyLibrary, available at: http://my.lib.ncsu.edu/

Enabling technologies and service designs

Library Hi Tech

Steve Mitchell, Julie Mason and Lori Pender

Volume 22 · Number 3 · 2004 · 295–306

References Chakrabarti, S. (2001), “Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction”, WWW 10, Hong Kong, May 2001, available at: www10.org/cdrom/papers/489/ Chakrabarti, S. (2003), Mining the Web: Discovering Knowledge from Hypertext, Morgan Kauffman, San Francisco, CA. Chakrabarti, S. et al. (2002), “Accelerated focused crawling through online relevance feedback”, WWW 2002, Honolulu, HI, available at: www.2002.org/CDROM/ refereed/336/ Dempsey, L. (2003), “The recombinant library: portals and people”, Journal of Library Administration, available at: www.oclc.org/research/staff/dempsey/ dempsey_recombinant_library/ Eibe, F. and Paynter, G.W. (2003), “Predicting library of congress classifications from library of congress subject headings”, Journal of the American Society for Information Science, Vol. 55 No. 3, available at: www.asis.org/Publications/ JASIS/vol55n03.html Flake, G. et al. (2002), “Self-organization and identification of Web communities”, IEEE Computer, Vol. 35 No. 3, available at: computer.org/computer/co2002/ r3066abs.htm Ghani, R. and Jones, R. (2002), “A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems”, paper presented at the LREC 2002 Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, available at: www-2.cs.cmu.edu/~rosie/papers/ ghanijoneslrec2002.pdf Glover, E.J. et al. (2002), “Using Web structure for classifying and describing Web pages”, WWW 2002, Honolulu, 7-11 May 2002, available at: www.2002.org/CDROM/refereed/504/ index.html Jones, R., Ghani, R., Mitchell, T. and Riloff, E. (2003), “Active learning for information extraction with multiple view feature sets”, paper presented at the ECML-03 Workshop on Adaptive Text Extraction and Mining, available at: www.cs.utah.edu/~riloff/psfiles/ecml-wkshp03.pdf Kerne, A. and Sundaram, V. (2003), “A recombinant information space”, paper presented at the COSIGN 2003: Computational Semiotics, University of Teesside, Middlesbrough, September, 2003, available at: www.cosignconference.org/cosign2003/papers/Kerne.pdf Lynch, C.A. (2001), “Metadata harvesting and the open archives initiative”, ARL Bimonthly Report 217, available at: www.arl.org/newsltr/217/mhp.html Mason, J. et al. (2000), “INFOMINE: promising directions in virtual library development”, First Monday, Vol. 5 No. 6, 5 June 2000, available at: www.firstmonday.dk/issues/ issue5_6/mason/index.html Menczer, F. (2004), “Correlated topologies in citation networks and the Web”, working paper, available at: www.informatics.indiana.edu/fil/Papers/webtopologies.pdf Menczer, F., Pant, G. and Srinivasan, P. (2004), “Topical Web crawlers: evaluating adaptive algorithms”, ACM TOIT,

2004, available at: www.informatics.indiana.edu/fil/ Papers/TOIT.pdf Mitchell, S. et al. (2003), “iVia open source virtual library system”, D-Lib Magazine, Vol. 9 No. 1, available at: www.dlib.org/dlib/january03/mitchell/01mitchell.html Nicholson, D. (2002), “Sketch for an automated approach to ‘Scoping Ahead’ in digital scotland and the DNER based on distributed ’Collection Strength’ indices”, Annexe A.4, in Final Report of the RSLP SCONE Project, available at: http://scone.strath.ac.uk/FinalReport/SCONEFPNXA4.pdf Park, S.B. and Zhang, B.T. (2003), “Large scale unstructured document classification using unlabeled data and syntactic information”, Lecture Notes in Artificial Intelligence, Vol. 2637, pp. 88-99, available at: http://bi. snu.ac.kr/Publications/Journals/International/ LNAI2637_Park.pdf Rennie, J. and McCallum, A. (1999), “Using reinforcement learning to spider the Web efficiently”, Proceedings of the Sixteenth International Conference on Machine Learning, available at: www.cs.cmu.edu/~mccallum/papers/rlspidericml99s.ps.gz Rennie, J., Shih, L., Teevan, J. and Karger, D.R. (2003), “Tackling the poor assumptions of Naive Bayes text classifiers”, paper presented at the Twentieth International Conference on Machine Learning (ICML 2003), available at: http:// haystack.lcs.mit.edu/papers/rennie.icml03.pdf Sarawagi, S., Chakrabarti, S. and Godbole, S. (2003), “Crosstraining: learning probabilistic mappings between topics”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 24-27 August, Washington, DC, available at: www.cs.berkeley.edu/~soumen/doc/sigkdd2003/ sigkdd2003.pdf Shih, L. et al. (2002), “Not too hot, not too cold: the bundledSVM is just right!”, Proceedings of the ICML-2002 Workshop on Text Learning, available at: www.ai.mit.edu/ people/jrennie/ papers/icml02-bundled.pdf Thelen, M. and Riloff, E. (2002), “A bootstrapping method for learning semantic lexicons using extraction pattern contexts”, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, available at: www.csutahedu/~riloff/psfiles/ emnlp02-thelen.pdf Wake, S. and Nicholson, D. (2001), “HILT – high-level thesaurus project: building consensus for interoperable subject access across communities”, D-Lib Magazine, Vol. 7 No. 9, available at: www.dlib.org/dlib/september01/wake/ 09wake.html Xerox Corporation Palo Alto Research Station (PARC) (2001), The Speakeasy Research Project, available at: www2. parc.com/csl/projects/speakeasy/ Yu, H., Yang, J. and Han, J. (2003), “Classifying large data sets using SVM with hierarchical clusters”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, available at: http://citeseer.nj.nec.com/yu03classifying. html

306

Introduction

Search and discovery across collections: the IMLS digital collections and content project Timothy W. Cole and Sarah L. Shreeves

The authors Timothy W. Cole is Mathematics Librarian and Professor of Library Administration at the University of Illinois, Urbana, USA. Sarah L. Shreeves is Project Coordinator and Assistant Professor of Library Administration at the University of Illinois, Urbana, USA.

Keywords Collecting, Information retrieval, Digital storage, Grants

Abstract In the fall of 2002, the University of Illinois Library at UrbanaChampaign received a grant from the Institute of Museum and Library Services (IMLS) to implement a collection registry and item-level metadata repository for digital collections and content created by or associated with projects funded under the IMLS National Leadership Grant (NLG) program. When built, the registry and metadata repository will facilitate retrieval of information about digital content related to past and present NLG projects. The process of creating these services also is allowing us to research and gain insight into the many issues associated with implementing such services and the magnitude of the potential benefit and utility of such services as a way to connect, bring together, and make more visible a broad range of heterogeneous digital content. This paper describes the genesis of the project, the rationale for architectural design decisions, challenges faced, and our progress to date.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 307–322 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560107

The World Wide Web offers cultural heritage institutions opportunities to enhance end-user services and reach larger and more widely distributed constituencies. Over the past few years there has been an explosion in the number of online information resources implemented by museums, libraries, archives, historical societies, and other cultural heritage institutions as they attempt to more aggressively exploit the potential of the Web. The benefit of having a rich diversity of quality and authoritative information available online is clear, but the magnitude of that benefit is tempered for many end-users by the difficulties in locating specific, desired information resources within the almost overwhelming aggregation of information now available. Every week there is more useful information available to find, but also every week, the amount of information that must be sorted through to find specific information desired grows as well. (Lyman and Varian, 2003) In addition, much of the information is “hidden” or “invisible”, i.e. in databases and other locations less accessible to Web search engines. (Sherman and Price, 2003). The community continues to struggle to develop new techniques for managing the glut of information and to transform traditional methods of curation and librarianship in order to better organize available digital information in aggregate and make it easier for end-users to find the specific online information they want and need to answer specific questions. In 2001, the Institute of Museum and Library Services (IMLS)[1] commissioned a Digital Library Forum to “discuss the implementation and management of networked digital libraries, including issues of infrastructure, metadata, thesauri and other vocabularies, and content enrichment such as curriculum materials and teacher guides.” (Institute of Museum and Library Services, 2001) In particular, the IMLS asked the Forum members to examine and comment on the opportunities for bringing the rich collections created with IMLS funding into digital libraries of national scope, an exemplar of which was (and is) the National Science Foundation’s National Science Digital Library (NSDL)[2]. The report of the Forum, developed with significant input from several NSDL participants, included general recommendations to the IMLS as well as specific recommendations for projects funded by IMLS. The IMLS Forum also developed and promulgated a Framework of Guidance for Building Good Digital Collections (Cole, 2002), which has Received: 25 May 2004 Revised: 2 June 2004 Accepted: 14 June 2004

307

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

since been adopted by the National Information Standards Organization (NISO)[3]. Among the general suggestions to IMLS, the Forum recommended that IMLS “should maintain its own registry of funded digital collections.” (Institute of Museum and Library Services, 2001). Acting on this recommendation and on other input, the IMLS in the fall of 2002, funded the University of Illinois Library at UrbanaChampaign to research, design, develop, and demonstrate a pilot implementation of a collection registry and item-level metadata repository based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)[4] to hold information describing digital collections and content created by or directly associated with National Leadership Grant (NLG) projects funded by IMLS since the inception of the NLG program in 1998. The Illinois research and demonstration project, now at its midpoint, is the subject of this paper. We describe in order the motivations for and objectives of our project, specific high-level architecture design decisions made, the nature of the challenges we have encountered, and our accomplishments to date. Along the way we discuss lessons learned to date and consider the relevance of this project to similar work going on elsewhere in the US and in Europe. We conclude with a brief discussion of open issues and planned work through the rest of our project.

single entry point through which they can learn more about the scope and accomplishments of IMLS funded digitization and content management programs. But our project is also a NLG project in its own right, and so is designed to demonstrate more generally certain infrastructure components potentially useful for building dispersed and dynamic user-centric digital library services. In this regard, our project is informed by and seeks to test in practice certain assumptions and hypotheses about the organization of digital content and the way in which such content can be shared and accessed effectively. The Framework of Guidance for Building Good Digital Collections draws an implicit distinction between digital collections and the value-added services which define digital libraries. The Framework articulates in one of its principles for “good” digital collections the definition that, “a good [digital] collection fits into the larger context of significant related national and international digital library initiatives.” The Framework expands on this statement of principle in its introduction: In today’s digital environment, the context of content is a vast international network of digital materials and services. Objects, metadata and collections should be viewed not only within the context of the projects that created them but as building blocks that others can reuse, repackage, and build services upon. Indicators of goodness correspondingly must now also emphasize factors contributing to interoperability, reusability, persistence, verification and documentation (National Information Standards Organization, 2004).

Project rationale, goals, and objectives In undertaking to provide a collection registry and item-level metadata repository for digital collections and content associated with IMLS NLG projects, there are two levels of motivation. On the one hand, a NLG registry and metadata repository can serve several immediate needs and parochial interests of the IMLS. In its final report the IMLS Digital Library Forum suggested that a registry of IMLS funded digital content “can aid grant applicants looking for models and practical examples of acceptable practice, can help further the sense of community among past and present awardees, and can provide a mechanism for identifying collections with various features (for example, those existing collections which might be appropriate for future inclusion in the NSDL).” (Institute of Museum and Library Services, 2001) To that we would also add that a NLG registry and metadata repository can provide a more comprehensive view of some of the best products and outcomes of the IMLS NLG program, which in turn would be useful for those in the field and in the general public looking for a

This view of information collections and objects as reusable building blocks or “recombinant” components of broader information systems (Seaman, 2003), and the distinction between digital collections and information objects and the value-added services that access and make use of them (Dempsey, 2003; Lynch, 2002), is consistent with a model of collections suggested by earlier researchers which describes bodies of information content as “information landscapes.” As defined by Michael Heaney,

308

The information landscape can be seen as a contour map in which there are mountains, hillocks, valleys, plains and plateaux. . . A specialized collection of particular importance is like a sharp peak. Upon a plateau there might be undulations representing strengths and weaknesses. . . The landscape is, however, multidimensional. Where one scholar may see a peak another may see a trough. The task is to devise mapping conventions which enable scholars to read the map of the landscape fruitfully, at the appropriate level of generality or specificity. (Heaney, 2000).

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

If these views of scholarly collections and content are correct, an essential role of digital libraries must be to offer the value-added services that provide the dynamic mapping functions described by Heaney, so as to allow scholars and students to view the information landscapes they encounter in the most useful ways possible to each individual. We hypothesize that infrastructure components such as collection registries and item-level metadata aggregations, assuming they are populated with collection-level and item-level descriptive metadata records of adequate quality, can support these essential classes of digital library services. It is our view that such infrastructure components have the potential to facilitate the reuse of digital content in new and different ways – by enabling more effective search and discovery across multiple collections, and by supporting the kinds of dynamic mapping between collections and among and between individual information objects that will allow communities of scholarly interest to view an information landscape as best meets their needs. Thus, a second rationale for our project is to create a testbed suitable for examining this hypothesis and the degree to which it might be valid within selected communities of interest. As discussed below, other projects as well are looking at similar models and hypotheses, but not all researchers in the community favor collection registries and heterogeneous metadata aggregations as cost effective ways to map the information landscape. Standardized approaches to collection-level description in particular have not been well explored or tested in the US as a piece of the cross-domain resource discovery puzzle. At one end of the spectrum of opinion there are concerns that approaches relying on ad-hoc connections between relatively small dispersed collections and dynamic recombinant approaches for associating widely dispersed and heterogeneous information objects lack the scale, sophistication, and tight coupling to specific target audiences necessary to sustain digital collections and make them truly useful in a scholarly context. In First Monday, Donald Waters of the Mellon Foundation, summarizing in narrative form the main points of his presentation at the 2004 Web-Wise meeting, suggests that, “There is as yet on the horizon no real substitute for the vision, discipline, and commitment needed to build digital collections at a scale and level of generality that will attract a broad audience of users and have such an impact on scholarship that their disappearance is not an option.” Waters goes on explicitly to express his concern that the ad hoc collection registries and metadata repositories over heterogeneous collections will not be adequate and

sufficiently persistent to support scholarship over the long-term. (Waters, 2004). At the other extreme are proponents of services like Google, which assume the ad hoc reusability of content, but (currently at least) are at best ambivalent as to the need for any special accommodation for reuse and repurposing, such as the creation of quality collection-level and itemlevel descriptive metadata. Google at present makes essentially no explicit use of manuallygenerated descriptive metadata or collection-level description, instead rely on brute computing force and free-text keyword indices and queries to provide search services over heterogeneous, disorganized full-content (or partial full-content). While there are indications of changes here – search engine system designers are showing renewed interested in metadata and are undertaking new initiatives to expose pieces of the “hidden Web,” particularly those of interest to researchers, as recent collaborations between Google and DSpace and Yahoo and OAIster demonstrate (Young, 2004; Suber, 2004) – this school of thought assumes that most traditional metadata paradigms are superseded by information retrieval operations over full-content and in any event do not scale well enough to be useful in the Web environment. Clearly both of these contrary perspectives are correct in some contexts. The question is whether there exist contexts and real-world use cases that fall in a middle-ground niche between these two extremes. Are there information needs in practice that are not met by Google-like approaches but for which large scale (and accordingly high-cost) monolithic digital library solutions of the sort envisioned by Waters would be overkill? Are there in fact information needs in practice that can be well enough met by services built over ad hoc collection registries and item-level metadata aggregations? Through the implementation of generic formal and informal standards for sharing collection information and item-level descriptions, can communities of interest build effective and useful digital library services across distributed collections of digital content developed originally for diverse audiences and with diverse intended purposes? And is the time and effort spent on achieving this middle road capability worth it? The answers are not immediately obvious. Our current effort is not of sufficient scope to fully answer these questions, but it is an explicit goal of the IMLS Digital Collections and Content (DCC) project[5] to make progress on this point and at least offer a contribution to furthering our understanding of the potential utility of general purpose collection-level and item-level metadata in the implementation of search and discovery

309

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

services across heterogeneous digital collections. By constructing and investigating the utility of a collection registry and metadata aggregation for IMLS NLG digital collections and content, we hope to provide at least anecdotal evidence pertinent to these issues. To accomplish these goals, as well as to meet the immediate needs of the IMLS NLG community for a shared content registry, we identified several intermediate project objectives prior to the start of our work a year and a half ago: . Survey IMLS grantees to establish baseline of current practice, attitudes towards metadata, and technical readiness to implement OAIPMH. . Define collection-level metadata schema for collection registry; concurrently define initial models for searching and browsing of collection registry. . Make available software and provide technical advice to encourage and facilitate grantee implementations of OAI-PMH. . Implement working and updatable collection registry; target participation 90 percent. . Harvest grantee metadata and implement search service across harvested aggregation of metadata; target participation 50 percent. . Analyze quality and consistency of harvested item-level metadata from perspective of usefulness for interoperability. . Investigate the research question, “How can resource developers best represent collections and items to meet the needs of service providers and end-users?” . Test usefulness of collection registry and itemlevel metadata aggregation with selected user populations. . Report on observations and issues regarding barriers to interoperability, potential for useful and marketable digital library services built on ad-hoc collection registries and item-level repositories, and challenges and prerequisites for production implementations of registries and repositories.

Librarian, is the Principal Investigator (PI), and the co-PIs within the UIUC Library are William Mischo, Engineering Librarian, and Nuala Koetter, Interim Head of the Digital Media and Technology Initiative. The focus of the GSLIS project team is on the research question mentioned above: “How can resource developers best represent collections and items to meet the needs of service providers and end-users?” To this end Associate Professors Carole Palmer and Michael Twidale, in conjunction with Research Assistants Ellen Knutson and Besiki Stvilia, are conducting interviews with IMLS NLG recipients and assessing metadata quality and use issues within the context of the local environment as well as within aggregations. This paper does not address the GSLIS research directly, but other preliminary reports on this work are available. (Knutson et al., 2003; Palmer and Knutson, n.d.).

Near the end of the project, based in part on our findings, the IMLS will decide whether to migrate prototypes of the collection registry and metadata repository we develop into permanent, production services. The IMLS DCC project is a collaboration between the University of Illinois Library and the University of Illinois Graduate School of Library and Information Science (GSLIS). The focus of the library project team is on the implementation of the collection registry and the item level metadata repository and draws on an extensive background in digital library infrastructure work, particularly in the OAI protocol. Timothy W. Cole, Mathematics

Top-level architecture decisions To create working prototypes of an IMLS collection registry and item-level metadata repository, we need to make early design decisions in two critical areas: (1) Selection of a model for cross-collection searching of item-level metadata; and (2) Selection of a model for the collection registry, specifically what entities would be described and included in the collection registry.

Cross-collection searching of item-level metadata The first decision, to use an OAI-PMH based metadata harvesting approach for collecting, aggregating, and searching item-level metadata, was explicitly required by the terms of the IMLS request for proposals (RFP) for this project, reflecting an assessment by IMLS that OAI-PMH is appropriate for project objectives and practicable and doable within project constraints. Based on our prior experience with OAI-PMH we concurred in this assessment. (Shreeves et al., 2003). From the perspective of the IMLS, OAI-PMH offers a low-barrier approach to metadata sharing that is technically within reach of at least most NLG projects developing digital content. Both turnkey commercial and Open Source OAI-PMH solutions are available. Technical barriers have been further lowered with the recent addition of recommended guidelines for OAI Static Repositories and Gateways[6], and new work is now underway to create a module for Open Source

310

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

Apache Web servers that would automatically export via OAI-PMH Web page metadata contained in HTML , meta . tags[7]. Though unfamiliar to some classes of cultural heritage institutions, the metadata harvesting model of OAI-PMH follows naturally from union catalog traditions in the library domain, and is conceptually congruent with the way large, wellestablished library cataloging utilities such as OCLC aggregate metadata for print content. While broadcast search approaches by definition are not designed to aggregate metadata locally on a single server, such approaches can and have been used to create virtual metadata aggregations and support cross-collection searching (e.g. the initial Dienst/NCSTRL implementation which supported one-stop searching of metadata records describing university computer science reports issued by institutions from across the globe (Davis and Lagoze, 2000). As compared to broadcast search approaches, the OAI-PMH harvesting and aggregation approach offers net pluses. Like the OAI-PMH, broadcast search models assume widely distributed primary content, and most broadcast approaches rely on metadata for search and discovery. The primary difference is that broadcast search approaches rely on real-time, simultaneous processing of end-user search requests by all content providers sharing content. This approach is technically challenging for smaller content providers and does not scale well in heterogeneous computing environments as the number of participating content providers grow. Broadcast search is only as reliable as the least reliable content provider in the group. Often in broadcast search models there also is divergence in how search semantics are interpreted across a heterogeneous union of content providers. Metadata aggregation approaches like OAI-PMH allow the harvester to normalize and enrich metadata aggregated and helps insure a more uniform and consistent search across the full catalog of metadata being shared. Aggregators can more easily analyze the full body of metadata being made available, thereby providing useful and more complete feedback to content providers about the consistency and quality of their metadata (at least in terms of utility for interoperability). For all these reasons OAI-PMH made sense for this project as the preferred model for cross-collection searching of item-level metadata.

We initially equated NLG projects with collections; that is, we assumed our NLG collection registry would simultaneously be a NLG project registry. This was an oversimplification. Closer investigation and input from the project steering committee made it clear that this approach led to confusion. How were we to deal with projects that involved multiple collections? How were we to deal with collections that were developed over the course of multiple projects? How were we to deal with situations where collection description attributes were not congruent with project description attributes – for instance, where the project administering institution was not the same as the collection owning institution? To resolve these issues we quickly moved towards a registry model which distinguished collections from projects. The primary entity in our registry is now explicitly the collection. Projects and other entities (e.g. related collections and agents) are maintained as separate entities and only described as necessary to establish their linkage and relation to a collection(s). The decoupling in our registry scheme of NLG projects and the collections to which they are related represented an important (although perhaps obvious in retrospect) design decision. Often times a NLG project’s primary goals are not the creation of a digital collection; instead they are training or collaboration or development of infrastructure. The digital collection created as a result of these activities is an important, but not fundamental end result of the project. Equating the digital collection with the IMLS NLG project would be misleading at best. An IMLS NLG project registry, though potentially a valuable resource and a recognized need within the IMLS community (such a registry was mentioned several times during the IMLS-sponsored “Digital Resources for Cultural Heritage: Current Status, Future Needs: A Strategic Assessment Workshop” held in August 2003) is not a goal for our project. This brought us to the question of what is a collection within the context of an IMLS collection registry. That the collection was digital and was created or developed with at least some IMLS NLG funding are two givens. Beyond that the definition of “collection” runs a wide gamut. Definitions vary from broad (“any aggregation of individual items”, including an aggregation of one, based upon almost any criteria (Johnston and Robinson, 2002)) to specific (information environments which facilitate information seeking by providing a context for resources selected and organized with a particular focus on the user (Lee, 2000)). Our particular question is not unique. Hill et al. have documented the struggles

Collection registry model Our design decisions for the collection registry centered on two distinct issues: . What are the entities to be included in the collection registry? . How will those entities be described?

311

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

of the Alexandria Digital Library team to define a digital collection. (Hill et al., 1999). Following from this research and discussions within the project team, we determined some necessarily broad criteria for inclusion of collections in the registry. In addition to the requirements mentioned above, collections were also to be: . cohesive (whether by topic area, type of material, etc.); . searchable as a distinct collection; and . available through a unique point of entry (i.e. a unique URL).

implemented – often with some modifications – by RSLP projects throughout the UK[9]. However, it has not been well tested for use in describing digital collections. The Dublin Core (DC) collection description application profile[10], currently in development, is based heavily on the RSLP CD schema, but has been adapted and somewhat simplified for digital collections. For instance, it does not attempt to describe the location(s) or agent(s) associated with a collection. We also spent some time examining the metadata schemas used in large, active collection registries such as Cornucopia, a database of museum collections in the UK[11], the NSDL[2], and EnrichUK[12], a registry of collections created through the New Opportunities Fund in the UK. After an analysis of these registries and schemas as well as discussions with the authors and maintainers of the RSLP CD schema, we determined that an adaptation of both the RLSP CD schema and the DC Collection Description Application Profile would best fit our needs. We discuss the further development of the IMLS DCC Collection Description Metadata Schema below.

A collection could have multiple sub-collections, provided these meet the same criteria above. The last criterion is largely practical and based upon the following user-scenario. Imagine that a large collection has multiple sub-collections without distinct URLs. If a search retrieves several of these sub-collections but the entry is always to the same top-level URL, a user may not understand the distinction between these various sub-collections. Requiring a unique URL will aid in eliminating confusion of being directed to the same URL multiple times. Once we had decided what to describe, we faced the natural follow-on: how to describe collections. We began by surveying what work had already been done on collection description. The use of systematic, standardized collection-level description, or collection-level metadata, for digital content is not very common in the US except in the domain of archives, where the Encoded Archival Description (EAD) is used to mark-up finding aids. Archival finding aids are what Heaney calls a i.e. “hierarchic finding-aid,” the collection description contains information about the collection as a whole as well as information about the individual items within the collection. Because the IMLS DCC project does not aim to describe both levels of description in a single registry, and because the creation of finding aids, whether in EAD or not, is a resource intensive enterprise, the use of EAD as the internal collection-level description schema of our registry was discarded as an option. Much work, however, has been done in the UK on “unitary finding aids” or collection-level description that contains only information about the collection as a whole and not about the individual items within it. The Research Support Libraries Programme (RSLP) Collection Description Schema (hereafter referred to as the RSLP CD schema) contains descriptive attributes about a collection, its location, agents associated with the collection, and relationships with internal or external collections[8]. The RSLP CD schema is well documented and has been

Challenges faced Three significant challenges we encountered early on in the implementation of the IMLS collection registry and item-level metadata repository were: (1) the heterogeneity of the IMLS-funded digital collections and content; (2) issues of metadata quality and consistency; and (3) the wide range of readiness, willingness, and technical capabilities among the NLG projects for implementing the OAI protocol.

Heterogeneity of IMLS funded digital collections and content When the Illinois team was awarded the grant, IMLS provided us with the grant proposals of all NLGs with digital content funded from 1998 through 2002. These proposals allowed us to document, at least to first-order, key characteristics over a range of 95 NLG projects. Specifically we used the proposals to identify institutions involved, project goals, collections created, content digitized or created, descriptive metadata schemes used, and technical specifications such as the content management system and whether an OAI data provider had already been planned or implemented. This information was updated and supplemented through a survey distributed in September 2003 to

312

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

92 PI’s representing 94 NLG projects with digital content[13]. We identified five non-active projects through the survey or other communications, leaving a survey population of 87. Our return rate was 76 percent. (This survey was sent to an additional 27 recipients of 2003 National Leadership Grants and non-respondents from the 1998-2002 pool in early May 2004; the information below refers only to the first round of the 1998-2002 NLG pool). The results of the survey and grant proposal analysis provide evidence of a diverse universe of IMLS funded digital collections and content. We found in particular: . a wide diversity of institutions and collaborations; . many different types of digital collections and sub-collections; and . a broad range of item level metadata schemas and controlled vocabularies in use.

institutions involved in the creation of the digital content as indicated from the grant proposals. The diversity of institutions – particularly within collaborative efforts – has an immediate, direct impact, as well as a more intangible impact, on our efforts. Our decision to enumerate each institution which contributes to a digital collection has meant that some collections, created through state-wide or broader collaborative efforts, are linked to literally a 100 or more institutions. Collections could potentially have many subcollections organized by the contributing institution. These were considerations in the design of the database supporting the collection registry. At a more granular level the item-level metadata often points back to the institution hosting the aggregate digital collection, rather than the actual contributor. Although this is reliant on how the metadata is created and mapped at the data provider end, it impacts on how we might link institutions to the content they created or contributed to a larger collection. The less tangible aspect of institutional diversity is in the world-view of the types of institutions represented here. Although all are broadly cultural heritage organizations, it is well recognized that museums, archives, and libraries each view the use and presentation of collections and content differently. In addition, although we began the paper speaking about cultural heritage, NLGs are also awarded to scientific organizations such as zoological societies and herbariums. In addition, NLG-funded projects often have specific uses in mind for the digital collections they create, such as use by the K-12 community or by specialists. These differences in perspectives directly impact how collections and particularly content are described. Table I shows two metadata records exported in simple DC through the OAI protocol. Each describes separate instances of the same World War II poster. The first metadata record is from a

Each of these characteristics of the population of collections and content for our project represents a challenge that must be addressed. Diversity of institutions and collaborations Of the 95 NLG grant proposals examined over half (54 percent) were collaborative efforts between multiple institutions. Including these collaborative partners we identified at least 237 distinct institutions from the grant proposals alone; however, after incorporating survey results and creating 84 preliminary collection registry entries, we actually documented 330 distinct institutions which have contributed to the digital collections described in our registry. Many of these contributors were not recorded on the grant proposal. The types of institutions range from large academic libraries with established digital library programs to small historical societies with little or no expertise in digital content creation. Figure 1 shows the types and numbers of

Figure 1 Types of institutions represented in NLG projects (from 1998-2002 grant proposals only)

313

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

Table I Comparison of metadata records describing separate instances of the same object Dublin core elements

Record one (traditional library cataloging)

Record two (access for educators/students)

Title Author Subject

Wanted! For murder: her careless talk costs lives Keppler, Victor World War, 1939-1945, US Espionage

Description

“US Government Printing Office: 1944 – O-595600” Woman’s photograph, Poster promotes vigilance

Coverage Date Rights

Not used 1944 Subject to US and international copyright laws, Please contact the owning repository Eng US Office of War Information Poster Image/jpeg

Wanted! For murder: her careless talk costs lives Not used World War II; War posters, American; National security; World War, 1939-1945 – Social Aspects – US Poster, b/with 27.9 £ 20 in, published by the US Government Printing Office During wartime concerns with about national security increase and World War II was no exception. This poster reminds citizens that sharing any military information such as troop movements, or other details could help the enemy sabotage the war effort. World War II 16 History; 14 Political Systems 1944 3-22-02 http://images.library.uiuc.edu/projects/ tdc/conditions.htm

Language Contributor Type Format

Not used Not used Image Not used

large academic library and has been cataloged in a manner consistent with traditional library practice. The second metadata record is from a NLG project whose primary goal was to promote the use of digital content within the curriculum of elementary and middle school teachers through collaboration between the teachers and content creators. To this end the metadata includes interpretive information and learning standards (16 History, for example) to which the poster could belong. Aggregators of item-level metadata from diverse organizations have to find mechanisms to cope with metadata created for different use environments and identify metadata records describing duplicate or closely related information objects. Types of digital collections and sub-collections As we began to examine the output of each of the NLG projects and as we received the survey results, we found that although most NLG projects created more or less traditional collections, albeit digital, a few of the collections were highly nontraditional, for instance: a multimedia exhibit that allows users to experience the oral history and visual images of a region simultaneously (Voices of the Colorado Plateau[14]), a Web site that actively tracks wildlife conservation efforts in the field (Field Trip Earth[15]), and digital art projects (“Banana”, “Code City”, and “Hard Place” at the

Lower East Side Tenement Museum[16]). Some of the more traditional digital collections included significant investment in peripheral material such as lesson plans, bibliographies, and contextual essays. The objects represented in these collections vary widely and include almost any type of material from manuscripts to maps to data sets to artifacts. Our challenge was to develop a collection-level metadata schema that could describe a wide range of these digital collections. Strategies we developed include the addition to the IMLS DCC Collection-level Description Schema of descriptive fields such as “Supplementary Materials.” Sub-collections also represent a potential challenge. Early in our discussions, we decided that we would allow the inclusion in our collection registry of records describing sub-collections at one level down from the parent collection (i.e. subcollections could not have children). In order to gauge the number of sub-collections that might be created, we asked in the survey whether the respondents had sub-collections, how many, and how these were organized. Seventy-six percent of the respondents reported that their collection was divided into sub-collections. Thirty-eight percent reported that they had between 2-5 sub-collections while 22 percent reported that they had 6-10 subcollections. Interestingly, a handful of respondents reported having many hundreds of sub-collections. In these cases the division was based on the subject headings used; every subject heading represented a

314

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

distinct sub-collection. Table II shows the organization of sub-collections. Note that 36 percent of the respondents reported organizing sub-collections on the basis of two or more factors. The challenge here is twofold. Again the collection-level description metadata schema used must be robust enough to handle a variety of descriptions. The RSLP CD schema and the Dublin Core Collection Description Application Profile have proven generally satisfactory in this regard, though we did have to make a few small customizations[17]. The structure for the database must also handle a proliferation of subcollections, and the registry display must communicate these structures and relationships to the user. This last requirement is especially difficult and we are still working on ways to satisfy this need.

locally developed schemas (39 percent) and the number who use multiple schemas (61 percent). Figure 2 shows the diversity of metadata standards (and non-standards in use). It should be noted that not all of the digital content created through the NLG program has item level metadata. Fourteen percent of the respondents reported not using descriptive metadata to describe the contents in their digital collection. These respondents for the most part had created collections that are not easily divisible into discrete items, such as multimedia exhibits, learning objects, or heavily integrated Web pages, and who provide no search services for specific individual resources. We cannot, of course, include these collections in the item-level metadata repository, although they will be represented in the collection registry. The diversity of metadata schemas can pose a significant challenge for the implementation of OAI data provider services. The OAI protocol requires the provision of metadata in at least simple DC. In order to implement OAI data provider services, NLG projects need to map their native metadata schemas to simple DC. Crosswalking between metadata schemas is not a trivial process and can be a barrier to implementation as many organizations are understandably reluctant to lose the complexity and semantic structure of their chosen metadata schema to the bluntness of DC. OAI-PMH supports the use of metadata schemas in addition to the DC, and we continue to encourage implementers of OAI data provider services to provide their metadata in its native schema as well. This does, however, require (for validation purposes) that content providers implement or point to valid and correct XML schemas for all metadata formats other than the DC that they export. Locating or creating such XML schemas is not necessarily a simple task, particularly when working from a unique, local metadata schema. Eighty-four percent of the respondents with item level metadata reported using some form of controlled vocabulary in their item level metadata. Table III identifies the most used controlled vocabularies for five types of values: subject, format, type, personal names, and geographic names. The diversity of controlled vocabularies and metadata schemas complicates the creation of an effective item-level metadata aggregation and has an impact on the utility of metadata for interoperability. This impact is additive to the impact of metadata quality and consistency generally within the sets of records contributed by each participating repository, as discussed in the next section.

Item-level metadata schemas and controlled vocabularies in use Eighty-six percent of the survey respondents reported using item-level metadata to describe the resources within their collections. The metadata standards most often in use are DC (56 percent of respondents with item level metadata) and MARC (33 percent of respondents with item level metadata). Other standards used include EAD, the Text Encoding Initiative (TEI) Header, Visual Resources Association (VRA) Core, Darwin Core, Making of America (MOA) 2, and the Taxonomic Data Working Group-Structure for Descriptive Data (TDWG-SDD). The diversity of item-level metadata in use by NLG projects is not surprising. Perhaps what is surprising are the number of respondents with item-level metadata who use Table II Basis of sub-collection organization (results from survey of 1998-2002 NLG recipients) Basis of sub-collection organization Administrative unit only Topic only Type of material only Other basis only

Number (percent) of respondents with sub-collections 6 10 8 8

(12) (20) (16) (16)

Based on two factors Administrative unit and topic Administrative unit and type of material Administrative unit and other Topic and type of material Topic and other

1 4 5 2

Based on three factors Topic, type of material, and administrative unit

4 (8)

2 (4) (2) (8) (10) (4)

Notes: Other responses included: learning standards; grade level appropriateness; keywords; time-period; audience; donating individual or organization

315

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

Figure 2 Metadata schemas in use (results from survey of 1998-2002 NLG recipients)

Table III Most used controlled vocabularies for five value types (results from survey of 1998-2002 NLG recipients)

Element Subject Format

Type Personal names Geographic names

Top three used controlled vocabulary (percent of respondents who identified CV) LCSH (73 percent); LC TGM I (27 percent); AAT (17 percent) LC TGM II (17 percent); AAT (10 percent); MIME types (8 percent); AACR2 (8 percent) LC TGM II (21 percent); DCMI Type (13 percent); AACR2 (10 percent) LC Name Authority File (67 percent) LCSH (27 percent); LC Name Authority File (25 percent); Getty Thesaurus of Geographic Names (15 percent)

Metadata quality in aggregate The challenges that are faced by OAI service providers (metadata harvesters) when aggregating metadata from multiple data providers are well documented (Shreeves et al., 2003; Halbert, 2003; Arms et al., 2003), and those facing the IMLS item-level metadata repository are no different. Briefly, some of the aggregation issues include: . Disparate and inconsistent use of DC elements. . Loss of information when providers map from more complex and expressive metadata schemas to simple DC. . Loss of browse capabilities due to diversity of controlled vocabularies and encoding schemes being used. . Varying practice in granularity of description and distinctions about what is described

.

(e.g. the physical artifact photographed or the digital manifestation/surrogate of the physical artifact). Variations due to broad range of types of material described.

What these issues illustrate is that the OAI community has yet to come to grips with what quality “shareable” metadata is. While there has certainly been work done on best practices for specific communities or domains (The Western States Dublin Core Metadata Best Practices[18] and the Open Language Archive Community[19] are two examples), there has been little research into what are the key attributes or metrics of quality for “shareable” or “interoperable” metadata. It may be that metadata of high quality within a local context is of significantly lesser quality (at least in terms of utility) when taken out of its local context and aggregated with other metadata records. Just as libraries had to come to grips with these sorts of interoperable quality issues when MARC records were shared via OCLC (Maciuszko, 1984) so too does the digital library community need to address these issues in the age of federated digital content. The GSLIS research team is currently investigating how to measure and assure metadata quality in aggregated digital collections. They are empirically examining the harvested metadata to develop systematic techniques for metadata quality assessment and assurance. We anticipate that this work not only will help content providers create metadata more useful in a shared context, but also will suggest ways in which OAI service providers can better normalize and/or enrich aggregated collections of metadata.

316

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

OAI capability and readiness Based on proposal analysis and the results of our initial survey, Table IV gives a preliminary assessment of the capability of the original pool of 94 1998-2002 NLG projects to implement OAI data provider services. While 44 percent of NLG projects either have or actively plan to implement the OAI protocol, 20 percent of respondents to the survey indicated that they had not heard of the OAI protocol. Beyond marketing the capabilities and potential of OAI through tutorials, presentations, and oneon-one conversations, we are also tracking why NLG projects might not be able or ready to implement data provider services. A preliminary review indicates that NLG projects may not be in a position to implement OAI data provider services for any of several reasons: . There is no item level metadata. This is true for many exhibit and learning object focused projects. . The collection is not yet public. NLG projects wish to wait until they unveil their digital collection before sharing the metadata. . Infrastructure is not in place. The metadata may not be mapped into DC or stored in a manner to easily support implementation of OAI data provider services. Necessary technical expertise may not be available. (Obviously these are especially problems for projects that have fully expended their NLG grants and may have no other available resources to implement infrastructure enhancements.) . The technical infrastructure is in transition or will be in transition. NLG projects are reluctant to implement OAI provider services in the midst of a migration to a new content management system. . Agreement has not yet been reached among all collaborators of a specific project to share that project’s metadata via OAI.

but we are actively working on ways to facilitate implementation of OAI by NLG grantees in other instances.

Some of these barriers are insurmountable (we can not harvest item-level metadata if there is none!),

Accomplishments to date At this, the mid-point of the IMLS DCC project, we have made progress on several fronts. Accomplishments to date are listed here and discussed in more detail below: . creation of the IMLS DCC Collection Description Metadata Schema; . development of a b IMLS Digital Collection Registry and Registry Entry/Edit Forms; . facilitation of implementation of OAI data providers for NLG funded collections; and . development of a b repository for item-level metadata harvested from NLG funded digital collections.

IMLS DCC Collection Description Metadata Schema The IMLS DCC Collection Description Metadata Schema[17], as mentioned above, is based on the RSLP Collection Description Metadata Schema and the Dublin Core Collection Description Application Profile. The IMLS DCC project has adapted these schemas to reflect the particular nature of the project and to incorporate the specific needs of our NLG collection registry. The resulting schema is meant to describe the digital collections created through IMLS funded NLG projects and does not describe in any detail the projects themselves. This metadata schema forms the basis of the IMLS NLG Collection Registry, currently in b phase of development. The following is meant only to give a cursory overview of the schema. There are four classes of entities described by the schema: . collections, including both NLG collections and physical or digital collections associated with (related to) a NLG collection;

Table IV Breakdown of NLG recipients according to readiness to implement OAI data provider services Category of 1998-2002 NLG Recipients:

Number/percent of NLG projects:

Group 1 – projects with OAI data provider sites for NLG content Group 2 – projects whose institutions have an OAI implementation (not yet being used for NLG content) or projects that have explicitly expressed plans to add OAI functionality Group 3 – projects who meet certain technical criteria – e.g. have item-level metadata and a maintained Web site Group 4 – projects with no item-level metadata, no interest in providing metadata via OAI, or whose grants were given up Unknown Total

317

21 (22 percent)

21 (22 percent) 23 (24 percent) 13 (14 percent) 17 (18 percent) 95

.

.

.

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

NLG projects associated with a NLG collection; institution(s) associated with a collection and/ or a NLG project; and administrators of NLG collections.

A collection may have been created by multiple NLG projects and have multiple administrators. A collection may have only one hosting institution, but may have multiple contributing institutions. A collection may have multiple sub-collections, associated collections, or source physical collections. A NLG project may have only one administering institution, but may have multiple participating (or collaborating) institutions. Figure 3 below shows the relationships between these entities. The complete list of schema elements (i.e. entity descriptive attributes) is available on the project Web site. An XML schema definition (.xsd) file appropriate for validating collection description metadata records is currently being finalized and will be added to the Web site soon.

IMLS digital collection registry As mentioned above several existing, active collection registries were examined for functionality, interface design, and metadata schema. In January 2003, we consulted with David Dawson of Resource: The Council for Museums, Archives & Libraries (now MLA) in the UK about Cornucopia and plans to develop a registry for the NOF-Digitise project (UK) and the Minerva project (Europe). Through examination of these registries and conversations with David and others, we identified several functions to be included in the IMLS DCC registry. They include browsing by topic area, expressing relationships among collections (e.g. parent-child), and limiting searches by time period, geographic area, Figure 3 Relationships between entities in the IMLS DCC collection description metadata schema

audience, and/or type of material. We also wanted the NLG projects to be able to edit their own collection registry records, so we examined several collection registry input forms such as those used by the NSDL and RSLP for design and functionality issues. Much of our development work thus far has been to design and test the database for the collection registry records and design the registry entry/edit forms. We are currently in the last stages of iterative design of the registry entry/edit forms. In the winter of 2003/04, 84 collection records were created from the survey results, then edited and expanded through information gleaned from collection Web sites and other communications. A preliminary browse interface for the collection registry has been developed as well. This, however, will undergo several more iterations. A staff view (partial) of a collection registry record is shown in Figure 4. Facilitating implementation of OAI data providers We have pursued several strategies for facilitating OAI-PMH implementation by IMLS NLG projects interested in participating in our IMLS DCC item-level aggregated metadata repository. While project funded has generally precluded onsite visits to implement OAI-PMH software, we have been able in a few cases to customize existing Open Source software solutions for use in the specific grantee environments. In other cases we have been able to assist by exercising (testing) and vetting OAI-PMH implementations created by grantees. Often implementations in these latter cases have been commercial turnkey solutions. Our testing has helped identify bugs or other possible issues or concerns with implementation of those solutions in specific grantee environments. A few examples are given here to illustrate the nature of this phase of our activities. In February 2003, we completed remote installation of an OAI data provider service for the Colorado Digitization Program (CDP). This service was implemented on top of the metadata storage infrastructure already in use by CDP, but did not require any changes by them to existing metadata processing or workflow. The CDP service supports exporting metadata in a qualified DC schema, as well as in simple DC. The implementation took advantage of pre-existing Apache Web server and MySQL implementations in place on the CDP servers. Tomcat extensions were added to the Apache application to allow the implementation of the Java Serverlets that implement the actual OAI metadata provider protocol services. We customized an existing generic Java Serverlet Open Source OAI provider application we had previously developed on an

318

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

Figure 4 Browser screen showing partial record from IMLS DCC collection registry (b version)

earlier project. (Generic versions of the all University of Illinois Library metadata provider implementations and associated XML schema definitions created as part of this work are available on SourceForge under UIUC/NCSA Open Source licensing[20]). In July 2003, we set up an OAI Static Repository[6] for the NLG project “American Natural Science in the First Half of the Nineteenth Century” based at the Academy of Natural Science. A recent development in the OAI protocol and designed for use with small, relatively static metadata collections, a static OAI repository is a single XML file which contains all repository metadata records and which sits on the metadata provider’s existing Web server. A third party acts as a gateway through which an OAI service provider can then harvest individual metadata records contained in that static XML file. This obviates the need for the source metadata provider to implement a dynamic Web service. The project team worked with Eileen Mathias at the Academy

of Natural Science to map metadata from MARC records to simple DC and produce a single XML file (with both MARC and DC records available for harvest) which is now available through an OAI Static Repository Gateway running on our servers at Illinois. The success of this implementation indicates that the static provider service approach is a good solution for institutions lacking technical infrastructure to implement new, dynamic Web services. In July 2003, we worked with the Washington State Libraries to test harvest metadata from their CONTENTdm data provider service. CONTENTdm is a digital library management system which has built-in an OAI data provider service. At that time the then current version of CONTENTdm (3.5) did not support resumption tokens. These are an optional feature in the 2.0 OAI protocol which aid in “flow control” by allowing a data provider to issue records in manageable chunks to a service provider, thus limiting the peak load on both systems. Although

319

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

optional, the implementation of resumption tokens is particularly important for large data providers. The Washington State Library repository proved too large to function reliably without resumption tokens. We examined other possible avenues for harvesting these records. We determined that dividing metadata into smaller sets (maximum of 10,000 records per set) could facilitate harvesting without flow control. We also developed a successful workaround in which we harvested records individually (using the OAIPMH GetRecord verb instead of the more typically used ListRecords verb). While this workaround was slow, it put little to no stress on the Web server and all metadata records were harvested successfully. However, based in part from feedback from us and Washington State, CONTENTdm has since implemented resumption tokens in their OAI provider module, improving robustness for large repositories using that software. Lastly, we created an OAI data provider for the IMLS-funded Illinois Alive project. The Illinois Alive collection consists of a series of Web pages about Illinois history. DC metadata for each Web page is embedded in the HTML Head element of each page. The IMLS DCC team developed a spider that crawled through the Illinois Alive pages to collect the DC metadata and store it within a SQL database on one of our servers. The metadata is then exposed via the OAI protocol. Similar functionality (implemented in a simpler and more robust fashion) is expected from the in-progress project mentioned above to create an Open Source OAI-PMH extension module for Apache Web servers.

metadata records harvested has been done. Based on our preliminary inspection of metadata so far harvested we have identified several automated normalization and augmentation functions that will be implemented soon. Some normalization and augmentation will need to be done on a repository-by-repository basis, and some can be applied across the entire aggregation. We anticipate that the systematic analysis of metadata quality and consistency currently being performed by our GSLIS colleagues will suggest additional normalization and augmentation functions. We also anticipate that the output of metadata normalization and augmentation processes will need to be stored (and indexed), internally at least, in a more expressive metadata schema than simple DC. We are currently testing the use of a qualified DC schema, extended with the addition of projectspecific encoding and refinement semantics. This approach will allow us to harvest and take fuller advantage of optional richer metadata formats made available by some of the participating OAI metadata providers. Cross-walks from these formats to qualified DC will be less lossy.

Item-level metadata repository To date 87,537 item-level metadata records have been harvested from 20 IMLS NLG collections using the OAI-PMH. Initial harvesting has been done exclusively in DC metadata format. Harvested records have been indexed in a Microsoft SQL database and a preliminary, early b version of a Web interface has been implemented to allow searching of the metadata aggregation. An illustrative search result screen from this preliminary interface is shown in Figure 5. Repositories are revisited every three weeks for incremental harvesting, and once every three months full re-harvests are done of each repository. Periodic full re-harvests are required since most OAI repositories from which we are harvesting for this project do not support the optional feature of the OAI protocol requiring providers to maintain in perpetuity a record of all metadata items ever deleted from their repository. To this point in the project little normalization or augmentation of

Conclusions The advent of the Web and other related digital technologies presents a good opportunity for increased content sharing and collaboration in the development of information systems. While a measure of interoperability, e.g. sharing generic HTML Web resources via Google, has proven relatively easy to accomplish, search and discovery across aggregations of more varied and complex digital content in a robust and full featured manner is proving harder than initially perceived by many of us. Making specialized scholarly digital content – primary content that is frequently non-textual, often hidden within complex database structures and collection contexts – more visible and easily accessible requires higher precision search and discovery systems that can exploit richer and more highly structured metadata. Issues of granularity and context are proving especially important when dealing with aggregation of such content. It is not yet clear whether ad hoc collection registries and item-level metadata aggregations built using a generic metadata harvesting protocol such as OAI-PMH are sufficient to implement the next generation of cross-repository digital library search and discovery services. As described above, a number of challenges exist, even in the context of our relatively controlled experiment with IMLS NLG digital collections and content. Based on our experience so far, part of the problem appears to be a lack of clear guidance and well-established

320

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

Figure 5 Sample result list from simple search of IMLS DCC metadata repository (b version)

best practices, not for creating metadata generally, but for creating metadata optimized for aggregation and interoperability. Our project and several similar projects currently in progress will help the community address this need. New metrics for metadata quality as defined in this context are emerging (Bruce and Hillmann, 2004), and at the very least we hope to help establish benchmarks for current metadata authoring practice and the implications of state-of-the-art practices for metadata harvesting and aggregation services. A further goal, and one that we have borne in mind as we develop both the collection registry and the item-level metadata repository, is to link the two so that users can move between one and the other. The lack of context for any given individual resource in an aggregation could perhaps be mitigated by the effective delivery and integration of collection level description for that resource with its item-level description. Finally, in the next phase of work on our IMLS DCC project, we hope to develop preliminary anecdotal evidence as to the potential benefit and utility of these kinds of services for one or two specific user populations. While a full-blown user study and analysis is beyond the scope of our

current grant, we do plan during the final year of the project further small-scale user focus groups, usability experimentation, and transaction log analysis, building on early work in this vein on an earlier OAI-PMH based metadata harvesting service project. (Shreeves and Kirkham, in press).

Notes

321

1 Institute of Museum and Library Services, available at: www.imls.gov/ 2 National Science Digital Library, available at: www.nsdl.org/ 3 National Information Standards Organization, available at: www.niso.org/ 4 Open Archives Initiative, available at: www.openarchives.org/ 5 IMLS Digital Collections and Content, available at: imlsdcc.grainger.uiuc.edu/ 6 Specification for an OAI Static Repository and an OAI Static Repository Gateway, available at: www.openarchives.org/OAI/2.0/guidelines-staticrepository.htm 7 Mod_OAI Project Homepage, available at: www.modoai.org/ 8 RSLP Collection Description Schema, available at: www.ukoln.ac.uk/metadata/rslp/schema/

Search and discovery across collections

Library Hi Tech

Timothy W. Cole and Sarah L. Shreeves

Volume 22 · Number 3 · 2004 · 307–322

9 For examples of projects that have implemented the RSLP Collection Description Schema, available at: www.ukoln.ac.uk/cd-focus/cdfocus-tutorial/lookslike.html 10 Dublin Core Collection Description Application Profile Summary, available at: www.ukoln.ac.uk/metadata/dcmi/ collection-ap-summary/ 11 Cornucopia: Discovering UK Collections, available at: www.cornucopia.org.uk/ 12 EnrichUK, available at: www.enrichuk.net/ 13 Of the original 95 projects, two of the NLG projects were follow-ons to previous NLG grants. We only sent one survey in these cases. One NLG award was returned due to the dissolution of the institution receiving it 14 Voices of the Colorado Plateau, available at: http://archive. li.suu.edu/voices/ 15 Field Trip Earth, available at: www.fieldtripearth.org/ 16 See “Banana”, “Code City”, and “Hard Place” on the Lower East Side Tenement Museum Web site, available at: www.tenement.org/features.html 17 IMLS DCC Collection Description Metadata Schema, available at: imlsdcc.grainger.uiuc.edu/ CDschema_overview.htm 18 Western States Dublin Core Metadata Best Practices, available at: www.cdpheritage.org/resource/metadata/ wsdcmbp/index.html 19 Open Language Archives Community, available at: www.language-archives.org/ 20 UIUC Open Source OAI Metadata Harvesting Project on SourceForge, available at: http://uiliboai.sourceforge.net/

References Arms, W.Y., Dushay, N., Fulker, D. and Lagoze, C. (2003), “A case study in metadata harvesting: the NSDL”, Library Hi Tech, Vol. 21 No. 2, pp. 228-37. Bruce, T.R. and Hillmann, D.I. (2004), “The continuum of metadata quality: defining, expressing, exploiting”, in Hillmann, D. and Westbrooks, E. (Eds), Metadata in Practice, ALA Editions, Chicago, IL, pp. 238-56. Cole, T.W. (2002), “Creating a framework of guidance for building good digital collections”, First Monday, Vol. 7 No. 5, available at: www.firstmonday.org/issues/issue7_5/ cole/index.html (accessed 13 May 2004). Davis, J.R. and Lagoze, C. (2000), “NCSTRL: design and deployment of a globally distributed digital library”, Journal of the American Society for Information Science, Vol. 51 No. 3, pp. 273-80. Dempsey, L. (2003), “The recombinant library: portals and people”, Journal of Library Administration, Vol. 39 No. 4, pp. 103-36. Halbert, M. (2003), “The metascholar initiative: AmericanSouth.Org and metaarchive.Org”, Library Hi Tech, Vol. 21 No. 2, pp. 182-98. Heaney, M. (2000), An Analytical Model of Collections and their Catalogues, available at: www.ukoln.ac.uk/metadata/rslp/ model/ (accessed 13 May 2004). Hill, L.L., Jane´e, G., Dolin, R., Frew, J. and Larsgaard, M. (1999), “Collection metadata solutions for digital library applications”, Journal of the American Society for Information Science, Vol. 50 No. 13, pp. 1169-81. Institute of Museum and Library Services (2001), Report of the IMLS Digital Library Forum on the National Science Digital Library Program, available at: www.imls.gov/pubs/ natscidiglibrary.htm (accessed 13 May 2004).

Johnston, P. and Robinson, B. (2002), “Collections and Collection Description”, Collection Description Focus Briefing Paper No. 1, available at: www.ukoln.ac.uk/cd_focus/briefings/ bp1/bp1.pdf (accessed 17 May 2004). Knutson, E., Palmer, C. and Twidale, M. (2003), “Tracking metadata use for digital collections [Poster Abstract]”, DC-2003: Proceedings of the International DC Metadata Conference and Workshop, available at: www.siderean.com/dc2003/706_Poster49-color.pdf (accessed 17 May 2004), pp. 243-4. Lee, H. (2000), “What is a collection?”, Journal of the American Society for Information Science, Vol. 51 No. 12, pp. 1106-13. Lyman, P. and Varian, H.R. (2003), How Much Information, available at: www.sims.berkeley.edu/how-much-info-2003 (accessed 13 May 2004). Lynch, C. (2002), “Digital collections, digital libraries, and digitization of cultural heritage information”, First Monday, Vol. 7 No. 5, available at: www.firstmonday.org/ issues/issue7_5/lynch/index.html (accessed 1 June 2004). Maciuszko, K.M. (1984), OCLC: A Decade of Development 19671977, Libraries Unlimited, Inc., Littleton, CO. National Information Standards Organization (2004), A Framework of Guidance for Building Good Digital Collections, available at: www.niso.org/framework/ forumframework.html (accessed 14 May 2004). Palmer, C. and Knutson, E. (n.d.), “Metadata practices and implications for federated collections”, ASIS&T 2004, Managing and Enhancing Information: Cultures and Conflicts, Proceedings of the 67th Annual Meeting of the American Society for Information Science & Technology Information Today, Inc., Medford, NJ (in press). Seaman, D. (2003), “Deep sharing: a case for the federated digital library”, EDUCAUSE Review, Vol. 38 No. 4, pp. 10-11. Sherman, C. and Price, G. (2003), “The invisible Web: uncovering sources search engines can’t see”, Library Trends, Vol. 52 No. 2, pp. 282-98. Shreeves, S.L. and Kirkham, C.M. “Experiences of educators using a portal of aggregated metadata”, Journal of Digital Information, Vol. 5 No. 3 (in press). Shreeves, S.L., Kaczmarek, J.S. and Cole, T.W. (2003), “Harvesting cultural heritage metadata using the OAI protocol”, Library Hi Tech, Vol. 21 No. 2, pp. 159-69. Suber, P. (2004), “The case for OAI in the age of Google”, SPARC Open Access Newsletter 73. available at: peters/fos/ newsletter/05-03-04.htm">www.earlham.edu/,peters/ fos/newsletter/05-03-04.htm (accessed 21 May 2004). Waters, D.J. (2004), “Building on success, forging new ground: the question of sustainability”, First Monday, Vol. 9 No. 5, available at: www.firstmonday.org/issues/issue9_5/ waters/index.html (accessed 17 May 2004). Young, J.R. (2004), “Libraries aim to widen Google’s eyes: search engines want to make scholarly work more visible on the Web”, The Chronicle of Higher Education, Vol. 50 No. 37, p. A1.

Further reading Shreeves, S.L. and Cole, T.W. (2003), “Developing a collection registry for IMLS digital collections [Poster Abstract]”, DC-2003: Proceedings of the International DCMI Metadata Conference and Workshop, available at: www.siderean.com/dc2003/705_Poster43.pdf (accessed 17 May 2004), pp. 241-2.

322

A changing reality

Architectural The way ahead: learning cafe´s in the academic marketplace Morell D. Boone

The author Morell D. Boone is Professor based at the Department of Interdisciplinary Technology at Eastern Michigan University, Ypsilanti, Michigan, USA.

Keywords Learning, Architecture, Library facilities

Abstract Libraries, like the universities they serve, are faced with the daunting task of reconciling the traditional role as repository and provider of information with the increasing demands of a market-driven society. Learning cafe´s can provide a place where these two divergent demands are potentially reconciled. By providing sophisticated technologies within a sociable environment, learning cafe´s seek to enhance the potential for interactive learning among its users. They have the potential to be hosts for an increasingly diverse array of emerging library services. Before incorporating a learning cafe´ within new or existing libraries, however, planners must keep in mind the types of learning best suited for this type of area and maintain a flexible design model so that the cafe´ can be adapted to future needs.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 323–327 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560116

Since the 1990s, libraries have moved steadily away from being mere repositories of printed materials to something more complex, more expansive, more exciting, and more indeterminate. Library designers have been progressively incorporating evermore sophisticated technology, and technological support systems, into buildings that are increasingly diversified in their functions. Students and faculty expect not only network accessibility and electronic catalogs, but also media collections, efficient delivery systems, and quasi-public spaces to engage in interactive learning. Library design and development parallels the changes, and tensions, occurring in universities as a whole. In my last column, Monastery to marketplace: a paradigm shift, this tension was highlighted by introducing an illuminating article by Nancy Cantor and Steven Schomberg published in the March/April 2003 issue of Educause. Universities are poised between two “worlds”, or, perhaps more precisely, two major – and divergent – world views. First, they strive to maintain their traditional, monastic, function as centers for the production of disinterested knowledge; however, they are increasingly responsive, especially in an age of shrinking appropriations, to the concerns and demand of the “marketplace” of the larger society. Libraries, as the authors contend, are no exception to this general trend: Libraries, even academic research libraries, can no longer avoid the noise and turmoil and unvetted free-for-all of the marketplace, yet they exist at least in a large part to remind us of our many pasts, including all the ideas and discoveries that never flourished in the marketplace (Cantor and Schomberg, 2003, pp. 14-15).

One can go further and say that library architecture has been embodying and articulating this tension for some time. The increasing complexity behind many library design projects, especially those that I have profiled before, is due to the need for libraries to provide their traditional services, while improving their adaptability to emerging, often market-driven, changes within universities and society. For example, as distance education technology is continually refined, libraries will play a major role in providing university instruction to off-site, often nonacademic, institutions. Likewise, online databases and e-books are becoming indispensable to working, commuter students who cannot always make time to get to campus. Even the development Received: 9 March 2004 Revised: 9 March 2004 Accepted: 13 June 2004

323

The way ahead: learning cafe´s in the academic marketplace

Library Hi Tech

Morell D. Boone

Volume 22 · Number 3 · 2004 · 323–327

of efficient delivery systems, like automated storage and retrieval systems, are a response to growing demands for libraries to provide more efficiency and “free up” space for new needs by relocating many books and printed materials away from public areas. The duality of monastery/ marketplace is a helpful conceptual tool for getting a handle on the dramatic changes occurring in the form and function of university libraries. The previous column concluded with a section called The learning cafe´ acknowledges the paradigm shift, in which the notion of the library either having or being a “learning cafe´” fits the idea of “a partnership in the learning process”. (Miller, 2002). This fits the model of providing physical spaces that are ensconced in the comfortable surroundings where people feel that they can incorporate eating and drinking and collaborative work: the cybrary cafe´ as a place where there is a social learning environment that integrates technology, but not dominated by it (Boone, 2003). It also introduced the example of an entire city, Glasgow, Scotland, declaring itself to be a “learning city” and how one of its universities, Glasgow Caledonia University, is developing a learning cafe´ approach into a more inclusive library service system. As Jan Howden, its director, puts it:

its intellectual purpose – too much marketplace, not enough monastery. Cafe´s, however, have a much more complex reality; they are not merely sites of commercial transactions. Historically, as well as today, cafe´s have functioned as spheres of collaborative activity and shared learning. From Sartre and de Beauvoir holding court at their favorite Parisian haunts to contemporary reading groups meeting at Starbuck’s, the sociability and free play of intellectual interaction has always been a feature of the cafe´. Thus, despite its apparent incompatibility, learning cafe´s are actually a sophisticated mediation – and potential reconciliation site – between the monastic mission of academic libraries and the commercial realities they face. They can, as Cantor and Schomberg put it, “best achieve the desired vibrancy of exchange [by working] on the balance of monastic and marketplace characteristics included in the world in-between” (Cantor and Schomberg, 2003, p. 21). What is in a name? The learning cafe´ as a concept focuses on providing a “gathering place” for students, faculty, researchers and others. They will find a greater emphasis on active and collaborative interactions to fulfill information access and learning technologies to meet their needs. The cafe´ concept requires shared use and stewardship of resources and technologies by a wide variety of both internal and external clients. How can the learning cafe´ accomplish this? By incorporating advanced features of a technologically adaptable cybrary within a sociable, interactive environment. The learning cafe´, therefore, integrates information services and technology within its confines, allowing clients to move seamlessly between knowledge acquisition and learning interaction. As the Internet becomes more sophisticated and new knowledge media develop, the functions of the cafe´ will become more apparent to its users. For example, when Internet 2 is fully deployed and the next generation of advanced networking is adopted, the parameters of what can be imported and exported will change drastically. Providing high-bandwidth connectivity and hardware within a cafe´ environment will enable clients to engage in accelerated interactive learning; allowing for near-simultaneous forays into research, knowledge acquisition, and shared learning. New media like virtual reality services will also permit greater interactivity – within simulated learning environments – for research and training.

The learning cafe´ was therefore always more than a new kind of group learning environment for Glasgow Caledonian students. This, and our general enthusiasm for what a “learning cafe´” should mean, has helped us move on and develop a range of aims and objectives for the cafe´ (Howden, 2002).

Since the column was published in September 2003, I have used it as the basis for two international presentations. The first was in October 2003 at Boole Library, University College Cork (Ireland) and the other was in November 2003 at a seminar hosted by SEMESP (the consortium of Brazilian private universities) in Sao Paulo, Brazil. At both venues, the idea of a “learning cafe´” serving as a transitional element between the monastery and marketplace was very well received. My next scheduled stop for presenting these ideas will be on October 2004 international conference hosted by Tianjin University of Technology and Education (People’s Republic of China). After Ireland, South America and Asia, can the USA be far behind?

The learning cafe´: sketching its potential Perhaps no other place embodies the tension, and potential, inherent in the monastery/marketplace duality than learning cafe´s. The idea of putting a cafe´ in a library is not universally accepted. Many library professionals believe that this degrades the library to a commercial operation and detracts from

Baseline and specialized services Although its scope is evolving, I envision learning cafe´s taking on baseline information services, specialized learning technology services and

324

The way ahead: learning cafe´s in the academic marketplace

Library Hi Tech

Morell D. Boone

Volume 22 · Number 3 · 2004 · 323–327

experimental learning activities that embrace emerging concepts. The first, baseline information services, are what most academic libraries are doing today. Examples of baseline information services required of the learning cafe´ are: . print and electronic text materials; . public access workstations; . special purpose workstations; . individual and small group workspaces; . training; . search assistance; and . technical assistance.

integration of sophisticated technologies that will promote advanced interactive learning. I also suggested that certain types of emerging concepts in experimental learning may assist in the creation of a leading edge learning cafe´. Let us take a closer look at some of these emerging instruction/ learning concepts.

The environment of the cafe´ is as important a role as the technologies to be incorporated within it. Since, as I’ve already said, cafe´s are natural spheres for sociability and interaction, it follows that learning cafe´s have the potential for facilitating group learning. Learning cafe´s can feature small, medium, and large group collaborative spaces where students and faculty can interact with advanced technology for group project work. In such fields as computer-aided design and engineering technology, where many students hold full-time jobs and find collaborative work difficult to coordinate, learning cafe´s can provide the resources for utilizing technology, provide a discussion area in a relaxed environment, and permit off-site interaction for students unable to come to campus. The cafe´’s social environment can thus be adapted to project team spaces and videoconferencing. As universities adopt more leading-edge technology, learning cafe´s will permit a variety of existing and new technology-centered instructional and learning activities. Examples of specialized learning technology services: . provide clients with unparalleled technology services in direct support of their academic experience; . promote interdisciplinary collaboration between clients in different disciplines and programs; . promote development of a basic digital institutional repository for student learning outcomes; and . provide a test bed for developing and deploying advanced instructional and learning systems.

What the future holds: emerging services in the learning cafe´ Up to this point, I have characterized learning cafe´s as sociable environments in which there are baseline information services and specialized learning technology services that provide for the

Advanced digital institutional repositories Many institutions are turning to developing institutional repositories to manage the content produced by faculty, staff, and students at their institutions. A digital repository, in the words of Alan McCord, is “a centrally managed collection of institutionally-generated digital objects designed to be maintained in perpetuity.” (McCord, 2003) These digital objects may also include student information systems, course management systems, automated library catalogs, and numerous Web sites. In addition to these university-level digital repositories, there are individual repositories, such as those for specific disciplines (e.g. biology) where the specialized needs of students, faculty, and researchers may be met. A digital repository within learning cafe´s will provide a focal point for integrating disparate digital information sources into a unified system that provides seamless and intuitive access for clients. Rich media production services In the near future faculty and students’ learning activities will focus more and more on developing rich media content; digital video, interactive Web sites, voice-over presentations, and integration of static and time-based assets into voice-over presentations. A learning cafe´ can provide advanced hardware and software to facilitate the capture, editing, producing, and sharing of rich media objects, especially within a group project format. These objects then may be stored in a “digital repository” for access by students, faculty, staff, researcher and the general public. “Triage approach” to service By integrating both information and learning technology services into a learning cafe´, it allows the clients to move quickly from separate information gathering activities into a seamless learning support system in support of greater learning productivity. Advanced Internet usage As Internet 2 is fully deployed and as the next version of advanced networking is adopted, the parameters of what can be imported and exported via the Internet will change drastically. Providing

325

The way ahead: learning cafe´s in the academic marketplace

Library Hi Tech

Morell D. Boone

Volume 22 · Number 3 · 2004 · 323–327

high-bandwidth connectivity, videoconferencing facilities, and advanced hardware with a learning cafe´ will enable clients to quickly apply these new services to their academic and professional work.

with the cafe´ may be designed to promote extreme programming concepts such as user participation, storyboarding, pair programming, iterative documentation, and constant unit testing.

Incorporating existing and developing new virtual reality environments Incorporating any existing virtual reality services and taking newly developed ones into a learning cafe´ may provide its clients with the opportunity to experiment with VR technologies into their instruction, learning, training and research.

Adaptability is the key At this point, you may be thinking: are we still talking about a cafe´ here? After all, is not the point of a cafe´ to provide a sphere away from actual work? Are we not simply integrating typical instructional and learning activities within the cafe´ structure and, perhaps, increasing the tension between the monastery and the marketplace rather than mitigating it? Can you create a social environment, traditional services, a technological delivery point, and a test bed of emerging technologies without the possibility of each of them canceling out the other? I do not know; however, I do know that Elizabeth Daley, Dean of the School of Cinema-Television at the University of California, says, “Print supports linear argument, but it does not value the aspects of experience that cannot be contained in books. . . Even the most cursory knowledge of media is not included in the general education curricula of most colleges or universities” (Daley, 2003, p. 35 and 38). Daley is adamant that a truly literate person in the 21st century will be one that has learned to both read and write “the multimedia language of the screen” (Daley, 2003, p. 34). Her statements highlight the crucial task that facility planners face in integrating technology within an interactive spatial environment to facilitate learning among different media. The key point to remember is that adaptability must be embedded before any technology is integrated. Poor planning can lead to over-determined environments that are not adaptable to new needs. A singular rectangular or square environment will provide fewer options for future use than more complex designs that allow for marked-off areas and “nooks” and “crannies” where group learners can congregate. If possible, creating “bi-leveled” cafe´s, where group activities can be separated from others by being up to a level (or even half a level), while still enjoying the cafe´’s sociable ambiance, is a good way to increase potentialities into your design. Similarly, if you place too many desktop computing stations in your learning cafe´, you will limit what can be done. By stressing laptop connectivity and especially wireless communication, you expand your design potentials and also avoid turning your learning cafe´ into a computer lab. Thus, flexibility and adaptability does not “satisfy all parties”. It is important to remember what kinds of learning are best suited for the learning cafe´ environment – group learning,

Video recruiting and training Many segments of the economy are increasingly using Internet-based videoconferencing technology to support their recruiting activities. Videoconferencing technology reduces travel costs and makes it possible for recruiters to “visit” more campuses. Learning cafe´s can facilitate this recruiting process by enabling students to interact with prospective employers at off-site locations. Through its flexible design format, the cafe´ can be converted into a simulated workplace environment that provides students with technologies and physical spaces to display the contents of their professional portfolios. A similar setup can be employed for specialized training sessions that involve group learning and integration of computer-based training application software. Smart facilities Learning cafe´s will be able to accommodate smart meeting rooms, holographic videoconferencing, collaborative learning spaces that embrace new directions in interdisciplinary project teams, and wireless communication. Advanced and emerging learning technology training A learning cafe´ may provide a physical place as a home for trainers to work directly with faculty, students, staff and others to receive “training” in both advanced learning technology systems and emerging technologies that are in the b stage. Also, such a sub-facility with the cafe´ may provide an experimental environment within which technical and instructional materials can be tested and evaluated before they are broadly deployed throughout the institution and beyond. Extreme programming collaboration A leading edge cafe´ may feature an “extreme programming” collaboration environment to provide clients with the opportunity to experiment with advanced software and systems development methodologies. The environment, in a sub-facility

326

The way ahead: learning cafe´s in the academic marketplace

Library Hi Tech

Morell D. Boone

Volume 22 · Number 3 · 2004 · 323–327

project collaboration, and technological interaction; the learning cafe´ is neither a substitute computer lab nor should it be designated a “quiet area” for individual study. For a learning cafe´, to fulfill its mediating and transitional function between learning and talking, working and sociability, monastery and marketplace, the interactive nature of the cafe´ environment must be emphasized. However, in the face of all this talk about facilities that can adapt to flexible needs I still come back to what I call the four-A’s of facility design: (1) Adaptability: . integrated technology system; . communication quadrants; . dual-track collection management systems; . floor access cable raceway system; . non-load bearing walls; and . anticipate program developments in advance. (2) Accessibility: . high impact entrance; . gathering places; . philosophy of “spaces and pathways”; . clear and understandable service points; . extended hours; and . go beyond the letter of the law regarding ADA. (3) Aesthetics: . positive perceptions equal a positive image; . warmth; . friendliness; . comfort; . design; . color; and . art. (4) Accommodation: . safety and security; . individual and collaborative work spaces; . special collections and services; . socialization; . enjoyment and celebration; and . coffee/tea and more.

integration of developing technologies with the increasing demands for group-based, distance, and other forms of interactive learning. The potential for learning cafe´s to be spheres of knowledge acquisition, through access to everything from library OPAC’s to institutional repositories, and dissemination, through sociable interaction or interaction mediated through technology (distance education, video recruiting/training), is great. Yet, learning cafe´s will only succeed if library planners develop flexible designs that promote specific types of learning, and do not fall into a “one cafe´ for all purposes” approach. Learning cafe´s are best in permitting interaction with advanced technologies, especially wireless and other technologies that promote laptop and other lightweight formats, and sociable learning through group interaction. Keeping these conditions in mind, learning cafe´s contain the possibility of reconciling the age-old, but evermore pressing, tension between a library’s academic mission and the ever increasing technology-centered demands of student, society, and the marketplace. As a parting thought – would it not be ironic if the academic learning cafe´ movement takes off from Scotland, the country of Andrew Carnegie’s birth? Carnegie (2004) the 19th century industrialist who had so much to do with investing in American libraries: “I think Carnegie’s genius was first of all, an ability to foresee how things were going to change,” says historian John Ingram (Public Broadcasting System, 2004). Where is Andy when we need him?

Conclusion I have tried to provide a sketch of why I think learning cafe´s are going to be one of the most important emerging design elements in library/ learning center planning in the next decade. They have the potential to provide a seamless

References Boone, M. (2003), “Architectural – monastery to marketplace: a paradigm shift”, Library Hi Tech, Vol. 21 No. 3. Cantor, N. and Schomberg, S. (2003), “Poised between two worlds: the university as monastery and marketplace”, Educause Review, Vol. 12 No. 21. Carnegie, A. (2004), Public Broadcasting System, available at: www.pbs.org/wgbh/amex/carnegie Daley, E. (2003), “Expanding the concept of literacy”, Educause Review, Vol. 12 No. 21. Howden, J. (2002), Interviewed at Glasgow Caledonian University, 22 November. McCord, A. (2003), “Institutional repositories: enhancing teaching, learning, and research”, Educause Evolving Technologies Committee, Anaheim, CA, 14 October. Miller, W. (2002), “The library as a place: tradition and evolution”, Library Issues, Vol. 22 No. 3. Public Broadcasting System (2004), “Andrew Carnegie: the richest man in the world”, Transcript, available at: www.pbs.org/wgbh/amex/carnegie/filmmore/transcript/ index.html

327

Interlibrary lending and document delivery have become an integral part of the services that contemporary libraries offer. These services mean that libraries do not have to buy every work on a subject of marginal interest, as long as they belong to a network where at least one member owns that work and is willing to share it. The positive image of rich libraries sharing with poorer ones is balanced by publishers’ fears of reduced sales. The concern is not merely US or North American. Elmar Mittler of Goettingen University in 1996 wrote that:

On copyright Copyright in the networked world: interlibrary services Michael Seadle

Publishers frequently see competition to their own activities in the document delivery services of libraries and other service providers. The author Michael Seadle is Assistant Director for Systems and Digital Services based at the Michigan State University, East Lansing, Michigan, USA.

Keywords Copyright law, Interlending, Canada, Germany, United States of America

Abstract Interlibrary lending and document delivery have become an integral part of the services that contemporary libraries offer. The copyright laws in most countries authorized this copying within reasonable limits, but tensions with publishers may be growing. For interlibrary services to remain effective, libraries must continue to lobby politicians to defend their legal basis. Libraries must also continue to work with publishers to address legitimate economic concerns. This paper looks at the legal basis for interlibrary services, particularly document delivery, in the US, Canadian, and German law.

Electronic access

When interlibrary lending mainly consisted of sending a physical volume to the requesting library, the potential effect on sales was minimal. Heavily used works like serials were rarely loaned, since no one at the home library could use a work until it came back. As copying technologies improved in the 1960s, it became possible to photocopy a single article rather than send the whole volume. The copyright laws in most countries authorized this copying within reasonable limits, but tensions with publishers may be growing. New contract language limiting document delivery services for materials from online databases and new court cases in Germany over international lending show two areas where such tensions are evident. Desktop delivery services have not yet, had a legal challenge, but seem potentially vulnerable. This paper looks at the legal basis for interlibrary services, particularly document delivery, in the US, Canadian, and German law.

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

US Copyright law Section 108 of the US copyright law explicitly authorizes document delivery for articles when a user at another library makes a single explicit request: (d) The rights of reproduction and distribution under this section apply to a copy, made from the collection of a library or archives where the user makes his or her request or from that of another library or archives, of no more than one article or other contribution to a copyrighted collection or periodical issue, or to a copy or phonorecord of a small part of any other copyrighted work. . . (17 USC 108, 2004).

Library Hi Tech Volume 22 · Number 3 · 2004 · pp. 328–332 q Emerald Group Publishing Limited · ISSN 0737-8831 DOI 10.1108/07378830410560125

Received: 20 June 2004 Revised: 20 June 2004 Accepted: 21 June 2004

328

Copyright in the networked world: interlibrary services

Library Hi Tech

Michael Seadle

Volume 22 · Number 3 · 2004 · 328–332

But certain conditions must apply, particularly in terms of the ownership and use of the copy:

shrink-wrap licenses may provoke little notice and not only may be accepted on behalf of the institution by lower-level staff, but may well be thrown out (in the case of shrink-wrapped) or never recorded (in the case of click-through). It is easy for a library to acquire item-level contractual obligations that could prevent some forms of lending, and be completely unaware of them. The US law warns strenuously against systematic reproduction that could substitute for subscription or purchase of a work:

(1) the copy or phonorecord becomes the property of the user, and the library or archives has had no notice that the copy or phonorecord would be used for any purpose other than private study, scholarship, or research. . . (17 USC 108, 2004).

The lending library must also display a copyright warning: (2) the library or archives displays prominently, at the place where orders are accepted, and includes on its order form, a warning of copyright in accordance with requirements that the Register of Copyrights shall prescribe by regulation (17 USC 108, 2004).

(g) The rights of reproduction and distribution under this section. . . do not extend to cases where the library or archives, or its employee. . . (2) engages in the systematic reproduction or distribution of single or multiple copies or phonorecords of material described in subsection (d): Provided that nothing in this clause prevents a library or archives from participating in interlibrary arrangements that do not have, as their purpose or effect, that the library or archives receiving such copies or phonorecords for distribution does so in such aggregate quantities as to substitute for a subscription to or purchase of such work (17 USC 108, 2004).

Requests for copies of a whole work or the greater part of it are allowed under certain circumstances, if the work is out of print and not available at a “fair” price: (e) The rights of reproduction and distribution under this section apply to the entire work, or to a substantial part of it, made from the collection of a library or archives where the user makes his or her request or from that of another library or archives, if the library or archives has first determined, on the basis of a reasonable investigation, that a copy or phonorecord of the copyrighted work cannot be obtained at a fair price. . . (17 USC 108, 2004).

The same requirements for ownership, use, and the display of a copyright warning apply. The law also explicitly allows the lending of copies of an “audiovisual news program” thanks to an exemption that Senator Howard Baker added to protect the Vanderbilt news archive: (f ) Nothing in this section — . . . (3) shall be construed to limit the reproduction and distribution by lending of a limited number of copies and excerpts by a library or archives of an audiovisual news program, subject to clauses (1), (2), and (3) of subsection (a). . . (17 USC 108, 2004).

One of the most important rights that apply is fair use, but this is balanced by a clause that puts contractual obligations before any of the privileges granted in the law: (f ) Nothing in this section — . . . (4) in any way affects the right of fair use as provided by section 107, or any contractual obligations assumed at any time by the library or archives when it obtained a copy or phonorecord of a work in its collections.

This section means that a library’s license agreement with the vendor or an article database takes precedence over any interlibrary lending rights in the law. It also means that click-throughlicense contracts can limit software lending, and shrink-wrap-licenses on music CDs or video DVDs can restrict use. Large contracts get fairly careful scrutiny, but click-through and

Precisely what constitutes quantities that would substitute for a subscription is not written into the law. In practice many US research libraries pay a royalty via the Copyright Clearance Center for titles used more than five times over the prior five years. This works out to an average of once per year. These royalty costs can be expensive. According to the Copyright Clearance Center (2004), they generally vary from $1 to $14, and can go higher. Those libraries that charge for document delivery can shift the costs of access from underfunded collections budgets to user fees. Many university libraries do not charge, however, and must absorb the costs in other ways. New online document delivery request mechanisms make guarding against abuse harder. It is not unusual for a graduate student to search an abstracts database, and then request every article in a journal issue, because it happens to be a theme issue on that topic. Institutional practice in dealing with this kind of request varies, but it seems too obvious like a substitute for buying a copy of that issue for comfort. Yet, the reason for denying such a request can be difficult to explain to users, especially when the request would almost certainly go through if made more slowly, article by article, over several days or weeks, simply because, in a large volume operation, no one would notice. From a publisher viewpoint, such evasion seems too easy. Users who consistently request articles from a journal that their institution can no longer afford to get on subscription receive the benefit of its intellectual content without having to fight colleagues over subscription priorities. And royalties for selected articles do not necessarily add

329

Copyright in the networked world: interlibrary services

Library Hi Tech

Michael Seadle

Volume 22 · Number 3 · 2004 · 328–332

up to a subscription cost. The spiral of increasing subscription prices as fewer and fewer institutions subscribe only forces greater reliance on document delivery. If royalty prices were to escalate in similar fashion, more libraries would be forced to pass the costs through to end-users, who might well resort to asking friends to make copies – bypassing the royalty-payment process. While this would be illegal, it could be hard to detect and stop.

resembles the “private study, scholarship, or research” clause in the US law. Canadian law also has a paragraph explicitly authorizing interlibrary services, but with some additional limitations: (5) A library, archive or museum or a person acting under the authority of a library, archive or museum may do, on behalf of a person who is a patron of another library, archive or museum, anything under subsection (1) or (2) in relation to printed matter that it is authorized by this section to do on behalf of a person who is one of its patrons, but the copy given to the patron must not be in digital form.

Canadian law Interlibrary services function across borders, especially across the US-Canadian border, and of course journals are equally international. Canadian copyright law authorizes document delivery in language relatively similar to that in the US. A user must make an explicit personal request: 30.2 (1) It is not an infringement of copyright for a library, archive or museum or a person acting under its authority to do anything on behalf of any person that the person may do personally under section 29 or 29.1 (Canada, 2003).

As with the US law, Canadian law specifies the purposes for which the request may be made. It also specifies the type of publication: (2) It is not an infringement of copyright for a library, archive or museum or a person acting under the authority of a library, archive or museum to make, by reprographic reproduction, for any person requesting to use the copy for research or private study, a copy of a work that is, or that is contained in, an article published in (a) a scholarly, scientific or technical periodical; or (b) a newspaper or periodical, other than a scholarly, scientific or technical periodical, if the newspaper or periodical was published more than one year before the copy is made (Canada, 2003).

Certain types of publications are also explicitly excluded: (3) Paragraph (2)(b) does not apply in respect of a work of fiction or poetry or a dramatic or musical work.

This exclusion corresponds to works that courts generally assume, were written for profit rather than for the advancement of knowledge. Canadian law sets specific conditions on the copying: (4) A library, archive or museum may make a copy under subsection (2) only on condition that (a) the person for whom the copy will be made has satisfied the library, archive or museum that the person will not use the copy for a purpose other than research or private study; and (b) the person is provided with a single copy of the work.

The language of the law implies a more active interrogation of the user than may ordinarily take place, otherwise the required purpose strongly

The ban on giving the copy to a patron in digital form restricts the use of a PDF-based desktop delivery service that has grown extremely popular in US research libraries – so popular in fact that, for paper-only publications, some faculty reputedly prefer requesting articles from other libraries, because then they do not have to go to the library, find the volume, and make the photocopy themselves. The intent of this digital copy restriction is presumably to make it more difficult for the user to distribute the article any further, but the prevalence of low-cost scanners makes that barrier merely a minor annoyance.

German law The similarities between the US and Canadian law are not surprising, given their geographic proximity and shared common law tradition. German copyright law belongs to a continental legal system that grew out of Roman law and often makes different assumptions. German interlibrary lending and document delivery services matter for research libraries because Germany remains one of the world’s most significant producers (and consumers) of scholarly publications. The Association of Research Libraries’ German-North American Resources Partnership (GNARP, formerly called the German Resources Project), had as one of its earliest missions, the establishment of interlibrary loan and document delivery between the German universities and their North American partners: The goal of the Document Delivery Working Group is to improve document delivery and interlibrary loan for German-language materials, both among ARL libraries and between German and North American research libraries. GermanNorth American Resources Partnership, 2004

The Working Group’s prototype document delivery service used SUBITO e. V., a registered society (eingetragener Verein) whose online service stemmed from an initiative of “the German

330

Copyright in the networked world: interlibrary services

Library Hi Tech

Michael Seadle

Volume 22 · Number 3 · 2004 · 328–332

Ministry for Education and Research and the German states. . .” (SUBITO, 2004). Unfortunately the prototype service has had to be suspended because of legal issues. The German copyright law is less explicit about authorizing copying for interlibrary services than the North American laws, but the intent of the language seems similar. Part 1, Section 6, paragraph 52a of the German copyright law authorizes making materials available for teaching (paragraph 1) and research (paragraph 2). The following is my translation: 1. Making available small parts of a published work, or works of limited extent such as single articles from a newspaper or journal, for use as part of instruction in schools, institutions of higher education, or non-commercial establishments for vocational training, is permitted for a specifically limited group of participants in so far as necessary for these purposes, and not for the pursuit of commercial goals. 2. Making available published parts of a work or a work of limited extent such as single articles from newspapers or journals, exclusively for a limited group of persons for their own scholarly research, is permitted in so far as necessary for these purposes, and not for the pursuit of commercial goals (Germany, Federal Republic, 2003).

Materials explicitly created for use in classroom teaching are excluded, and the use of film is allowed after an appropriate waiting period: (2) Making a work available that was specifically for use in teaching is permissible only with the consent of the rights holder. Making a film publicly available before the passage of two years after its appearance in theaters in the region where this law applies is permissible only with the consent of the rights holder.

The copying necessary to these permissions is also explicitly authorized: (3) The duplication required for making works available in accordance with section (1) is permitted.

But this copying is not royalty free, and it requires the use of the German equivalent of the Copyright Clearance Center. (4) A reasonable amount must be paid for making works available in accordance with paragraph 1. The payment must be handled through a collecting society.

At present (June, 2004) a legal challenge against SUBITO is underway that appears to attack, not the legal basis of the international document delivery service, but the costs mentioned in the section above: The trade association of the German book dealers has stopped the Berlin-based SUBITO document delivery service. The background has to do with a controversy over sending digital extracts from

books and articles abroad. But the core of the issue turns on the size of royalties for books and magazines in this digital age. Thus, providing electronic services will become significantly more expensive and complex. Subito chairman Uwe Rosemann fears this will torpedo document delivery to researchers. Modern remote lending could be threatened. – Heise Online, 2004

While negotiations are underway to address the challenge to SUBITO, this suit represents only one instance of the problem of maintaining an economic balance that does justice to both interests of publishers and libraries and their users. A similar legal challenge could occur anywhere if publishers begin to feel that document delivery has become a substitute for subscriptions.

Conclusion Interlibrary lending and particularly document delivery have been one of the success stories of library collaboration in recent decades. The demand for these services is growing and their quality has improved greatly because of the use of digital techniques and Internet-based delivery. They do a great deal to equalize resources among institutions, and generally help students to get the materials they need at little or no cost. Publishers have not resisted the growth of interlibrary services, partly because of the strong legal basis for them, partly because of the compensation they receive. But the more digital the process becomes, the more publishers worry about the specter of a KaZaA-style free exchange that bypasses the royalty process. While some publishers have tried to write contract language to restrict the use of articles in online databases for interlibrary lending, many universities resist such provisions. The current balance of interlibrary services and royalty payments represent only one of many possible economic models. Large-scale consortial deals for access to article databases effectively eliminates the need for document delivery for those works among those institutions, though it does so at a high cost to libraries and often with the inclusion of unwanted titles. It is not clear that such a model would scale well to meet the demands of scholars. For interlibrary services to remain effective, libraries must continue to lobby politicians to defend their legal basis. Libraries must also continue to work with publishers to address legitimate economic concerns. The current system works far too well not to continue to defend it.

331

Copyright in the networked world: interlibrary services

Library Hi Tech

Michael Seadle

Volume 22 · Number 3 · 2004 · 328–332

References 17 USC 108 (2004), United States Code, Title 17, Chapter 1, Section 108, available at: www.copyright.gov/title17/ 92chap1.html#108 (accessed June 2004). Canada (2003), Consolidates Statutes and Regulations, Copyright Act, available at: http://laws.justice.gc.ca/en/ C-42/39062.html#rid-39082 (accessed June 2004). Copyright Clearance Center (2004), “Transactional reporting service: frequently asked questions”, available at: www.copyright.com/Help/HelpTrsFAQ.asp#2 (accessed June 2004). German-North American Resources Partnership (2004), “Document delivery working group”, available at: http:// grp.lib.msu.edu/docdelivery.html (accessed June 2004). Germany, Federal Republic (2003), “Federal Laws, Copyright Law [Urhebergesetz], Paragraph 52a”, My Translation,

available at: http://bundesrecht.juris.de/bundesrecht/urhg/ __52a.html (accessed June 2004). Heise Online (2004), “The book trade lobby goes against libraries because of document delivery”, My Translation, available at: www.heise.de/newsticker/meldung/48024 (accessed June 2004). SUBITO (2004), “What it SUBITO?”, available at: www.subitodoc.com/ (accessed June 2004).

Further reading Mittler, E. and Ecker, R. (1996), “European copyright user platform: ECUP and ECUP+”, Bibliotheksdienst Heft 8/9, 96. My Translation, available at: http://deposit.ddb.de/ep/ netpub/89/96/96/967969689/_data_stat/www.dbi-berlin. de/dbi_pub/bd_art/96_08_08.htm (accessed June 2004).

332