MARC and Metadata - METS, MODS, and MARCXML : Current and Future Implications 9781845443924, 9780861769414

Metadata schema and standards are now a part of the information landscape. Librarianship has slowly realized that MARC i

185 58 3MB

English Pages 113 Year 2004

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

MARC and Metadata - METS, MODS, and MARCXML : Current and Future Implications
 9781845443924, 9780861769414

Citation preview

lht_cover_(i).qxd

04/05/2005

13:24

Page 1

Volume 22 Number 1 2004

ISBN 0-86176-941-4

ISSN 0737-8831

Library Hi Tech MARC and metadata: METS, MODS and MARCXML: current and future implications. Part 1 Theme Editor: Bradford Lee Eden

Library Link www.emeraldinsight.com/librarylink

www.emeraldinsight.com

Library Hi Tech Volume 22, Number 1, 2004

ISSN 0737-8831

MARC and metadata: METS, MODS and MARCXML: current and future implications. Part 1 Theme Editor: Bradford Lee Eden

Contents 3 Access this journal online 4 Abstracts & keywords 6 EDITORIAL Metadata and librarianship: will MARC survive? Bradford Lee Eden

Theme articles 8 Cyril: expanding the horizons of MARC 21 Jane W. Jacobs, Ed Summers and Elizabeth Ankerson 18 The MARC standard and encoded archival description Peter Carini and Kelcy Shepherd 28 MARC to ENC MARC: bringing the collection forward Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

52 An introduction to the Metadata Encoding and Transmission Standard (METS) Morgan V. Cundiff 65 Pulling it all together: use of METS in RLG cultural materials service Merrilee Proffitt 69 A preliminary crosswalk from METS to IMS content packaging Raymond Yee and Rick Beaubien 82 An introduction to the Metadata Object Description Schema (MODS) Sally H. McCallum 89 Using the Metadata Object description schema (MODS) for resource description: guidelines and applications Rebecca S. Guenther

40 After MARC – what then? Leif Andresen

Access this journal electronically The current and past volumes of this journal are available at:

www.emeraldinsight.com/0737-8831.htm You can also search over 100 additional Emerald journals in Emerald Fulltext:

www.emeraldinsight.com/ft See page following contents for full details of what your access includes.

Contents

(continued)

Columns 99 ABOUT XML Whither HTML? Judith Wusteman 106 ON COPYRIGHT Copyright in a networked world: ethics and infringement Michael Seadle 111 Note from the publisher

www.emeraldinsight.com/lht.htm As a subscriber to this journal, you can benefit from instant, electronic access to this title via Emerald Fulltext. Your access includes a variety of features that increase the value of your journal subscription.

How to access this journal electronically To benefit from electronic access to this journal you first need to register via the Internet. Registration is simple and full instructions are available online at www.emeraldinsight.com/ rpsv/librariantoolkit/emeraldadmin Once registration is completed, your institution will have instant access to all articles through the journal’s Table of Contents page at www.emeraldinsight.com/0737-8831.htm More information about the journal is also available at www.emeraldinsight.com/lht.htm Our liberal institution-wide licence allows everyone within your institution to access your journal electronically, making your subscription more cost effective. Our Web site has been designed to provide you with a comprehensive, simple system that needs only minimum administration. Access is available via IP authentication or username and password.

E-mail alert services These services allow you to be kept up to date with the latest additions to the journal via e-mail, as soon as new material enters the database. Further information about the services available can be found at www.emeraldinsight.com/ usertoolkit/emailalerts Emerald WIRE (World Independent Reviews) A fully searchable subject specific database, brought to you by Emerald Management Reviews, providing article reviews from the world’s top management journals. Research register A web-based research forum that provides insider information on research activity world-wide located at www.emeraldinsight.com/researchregister You can also register your research activity here. User services Comprehensive librarian and user toolkits have been created to help you get the most from your journal subscription. For further information about what is available visit www.emeraldinsight.com/usagetoolkit

Choice of access Key features of Emerald electronic journals Automatic permission to make up to 25 copies of individual articles This facility can be used for training purposes, course notes, seminars etc. This only applies to articles of which Emerald owns copyright. For further details visit www.emeraldinsight.com/copyright Online publishing and archiving As well as current volumes of the journal, you can also gain access to past volumes on the internet via Emerald Fulltext. Archives go back to 1994 and abstracts back to 1989. You can browse or search the database for relevant articles. Key readings This feature provides abstracts of related articles chosen by the journal editor, selected to provide readers with current awareness of interesting articles from other publications in the field. Reference linking Direct links from the journal article references to abstracts of the most influential articles cited. Where possible, this link is to the full text of the article. E-mail an article Allows users to e-mail links to relevant and interesting articles to another computer for later use, reference or printing purposes.

Additional complementary services available Your access includes a variety of features that add to the functionality and value of your journal subscription:

Electronic access to this journal is available via a number of channels. Our Web site www.emeraldinsight.com is the recommended means of electronic access, as it provides fully searchable and value added access to the complete content of the journal. However, you can also access and search the article content of this journal through the following journal delivery services: EBSCOHost Electronic Journals Service ejournals.ebsco.com Huber E-Journals e-journals.hanshuber.com/english/index.htm Ingenta www.ingenta.com Minerva Electronic Online Services www.minerva.at OCLC FirstSearch www.oclc.org/firstsearch SilverLinker www.ovid.com SwetsWise www.swetswise.com

Emerald Customer Support For customer support and technical help contact: E-mail [email protected] Web www.emeraldinsight.com/customercharter Tel +44 (0) 1274 785278 Fax +44 (0) 1274 785204

The MARC standard and encoded archival description

Abstracts & keywords

Peter Carini and Kelcy Shepherd Keywords Archives, Descriptive cataloguing, Standards, Administrative data processing This case study details the evolution of descriptive practices and standards used in the Mount Holyoke College Archives and the Five College Finding Aids Access Project, discusses the relationship of EAD and the MARC standard in reference to archival description, and addresses the challenges and opportunities of transferring data from one metadata standard to another. The study demonstrates that greater standardization in archival description allows archivists to respond more effectively to technological change. MARC to ENC MARC: bringing the collection forward

Metadata and librarianship: will MARC survive? Bradford Lee Eden

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

Keywords Administrative data processing, Technical services, Librarians, Information management

Keywords Standards, Administrative data processing, Archives This paper will describe the way in which the USMARC cataloging schema is used at the Eisenhower National Clearinghouse (ENC). Discussion will include how ENC MARC extensions were developed for cataloging mathematics and science curriculum resources, and how the ENC workflow is integrated into the cataloging interface. The discussion will conclude with a historical look at the in-house data transfer from ENC MARC to the current production of IEEE LOM XML encoding for record sharing and OAI compliance required under the NSDL project guidelines.

Metadata schema and standards are now a part of the information landscape. Librarianship has slowly realized that MARC is only one of a proliferation of metadata standards, and that MARC has many pros and cons related to its age, original conception, and biases. Should librarianship continue to promote the MARC standard? Are there better metadata standards out there that are more robust, user-friendly, and dynamic in the organization and presentation of information? This special issue will examine current initiatives that are actively incorporating MARC standards and concepts into new metadata schemas, while also predicting a future where MARC may not be the metadata schema of choice for the organization and description of information.

After MARC – what then? Leif Andresen Keywords Databases, Online cataloguing, Libraries, Denmark

Cyril: expanding the horizons of MARC 21

The article discusses the future of the MARC formats and outlines how future cataloguing practice and bibliographic records might look. Background and basic functionality of the MARC formats are outlined, and it is pointed out that MARC is manifest in several different formats. This is illustrated through a comparison between the MARC21 format and the Danish MARC format ‘‘danMARC2’’. It is argued that our present cataloguing codes and MARC formats are based primarily upon the Paris principles and that ‘‘functional requirements for bibliographic records’’ (FRBR) would serve as a more solid and user-oriented platform for future development of cataloguing codes and formats. Furthermore, it is argued that MARC is a library-specific format, which results in neither exchange with library external sectors nor inclusion of other texts being facilitated. XML could serve as the technical platform for a model for future registrations, consisting of some core data and different supplements of data necessary for different sectors and purposes.

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen Keywords Languagues, Russian, Records management, Libraries Describes the construction of the author’s Perl program, Cyril, to add vernacular Russian (Cyrillic) characters to existing MARC records. The program takes advantage of the ALA-LC (Library of Congress, 1997) standards for Romanization to create character mappings that ‘‘de-transliterate’’ specified MARC fields. The creation of Cyril raises both linguistic and technical issues, which are thoroughly examined. Concludes by considering the implications for cataloging and authority control standards as we move to a multilingual, multi-script bibliographic environment. Library Hi Tech Volume 22 . Number 1 . 2004 . Abstracts & keywords # Emerald Group Publishing Limited . ISSN 0737-8831

4

Abstracts & keywords

Library Hi Tech Volume 22 . Number 1 . 2004 . 4-5

An introduction to the Metadata Encoding and Transmission Standard (METS)

compatible XML schema for descriptive metadata. It explains the requirements that the schema targets and the special features that differentiate it from MARC, such as user-oriented tags, regrouped data elements, linking, recursion, and accommodations for electronic resources.

Morgan V.Cundiff Keywords Online cataloguing, Digital libraries, Standards This article provides an introductory overview of the Metadata Encoding and Transmission Standard, better known as METS. It will be of most use to librarians and technical staff who are encountering METS for the first time. The article contains a brief history of the development of METS, a primer covering the basic structure and content of METS documents, and a discussion of several issues relevant to the implementation and continuing development of METS including object models, extension schemas, and application profiles.

Using the Metadata Object Description Schema (MODS) for resource description: guidelines and applications Rebecca S. Guenther Keywords Online cataloguing, Standards, Archives This paper describes the Metadata Object Description Schema (MODS), its accompanying documentation and some of its applications. It reviews the MODS user guidelines provided by the Library of Congress and how they enable a user of the schema to consistently apply MODS as a metadata scheme. Because the schema itself could not fully document appropriate usage, the guidelines provide element definitions, history, relationships to other elements, usage conventions, and examples. Short descriptions of some MODS applications are given and a more detailed discussion of its use in the Library of Congress’ Minerva project for Web archiving is given.

Pulling it all together: use of METS in RLG cultural materials service Merrilee Proffitt Keywords Online cataloguing, Standards RLG has used METS for a particular application, that is as wrapper for structural metadata. When RLG cultural materials was launched, there was no single way to deal with ‘‘complex digital objects’’. METS provides a standard means of encoding metadata regarding the digital objects represented in RCM, and METS has now been fully integrated into the workflow for this service.

Whither HTML? Judith Wusteman

A preliminary crosswalk from METS to IMS content packaging

Keywords Hypertext markup language, Extensible markup language, World Wide Web, Development

Raymond Yee and Rick Beaubien Keywords Extensible markup language, Education

DHTML has reinvented itself as an XML application. The working draft of the latest version, XHTML 2.0, is causing controversy due to its lack of backward compatibility and the deprecation – and in some cases disappearance – of some popular tags. But is this commotion distracting us from the big picture of what XHTML has to offer? Where is HTML going? And is it taking the Web community with it?

As educational technology becomes pervasive, demand will grow for library content to be incorporated into courseware. Among the barriers impeding interoperability between libraries and educational tools is the difference in specifications commonly used for the exchange of digital objects and metadata. Among libraries, METS is a new but increasingly popular standard; the IMS content-package (IMS-CP) plays a parallel role in educational technology. This article describes how METS-encoded library content can be converted into digital objects for IMS-compliant systems through an XSLT-based crosswalk. The conceptual models behind METS and IMS-CP are compared; the design and limitations of an XSLT-based translation, described; and the crosswalks are related to other techniques to enhance interoperability.

Copyright in a networked world: ethics and infringement Michael Seadle Keywords ????? ????? ????? ????? ????? ????? ????? The statutes themselves are not the only basis for deciding whether an intellectual property rights infringement has occurred. Ethical judgments can also influence judicial rulings. This column will look at three issues of intellectual property ethics: the nature of the property, written guidelines for behavior, and enforcement mechanisms. For most active infringers digital property seems unreal, the rules ambiguous, and the enforcement statistically unlikely.

An introduction to the Metadata Object Description Schema (MODS) Sally H. McCallum Keywords Online cataloguing, Extensible markup language This paper provides an introduction to the Metadata Object Description Schema (MODS), a MARC 21

5

The recent proliferation of metadata standards and schemas are part of the information explosion that has occurred in the past ten years. The appearance of the Internet and the World Wide Web, the ease-of-use and affordability of computers and e-mail at home and at work, the rapid pace at which scholarship and research can now take place due to technology and e-collegiality, and the rise in the number and complexity of formats in which information can be contained and stored, has meant that numerous communities and individuals are challenged to find ways to effectively send, store, preserve, exchange, and migrate information in the electronic environment. While libraries have usually led the way regarding presentation and storage of information in regards to the print environment, their place in this new electronic age has yet to be determined and measured. The development of the MARC format for the exchange of information between and among computers in the early 1970s, along with the development of AACR to standardize the use of MARC, helped libraries to take advantage of the power of computers to access and organize information, and to electronically present and display information to their public. In this new environment of CDs, DVDs, aggregator databases, full-text, data sets, PDFs, jpgs, tiffs, OCRs, and streaming media (just to name a few), and with scholarly and commercial entities constructing their own information/metadata standards to deal with the challenges and problems of electronic interchange and storage, libraries are often left out of the loop if not totally forgotten. Does the MARC format have a place in this new information age? Is it outdated, archaic, too based on the print/card catalog era? Is it robust and dynamic enough to describe and document the current information explosion of formats, containers, data, hypertext, and media, as well as future technological information packages? Are libraries and librarians exploring and experimenting with the MARC format, in order to make it more viable, useable, and interoperable for today’s information needs? This Library Hi Tech issue presents case studies, experiments, and opinions of the MARC format by a variety of individuals and organizations. In some cases, MARC has been

Editorial Metadata and librarianship: will MARC survive? Bradford Lee Eden

The author Bradford Lee Eden is Head, Web and Digitization Services, for the University of Nevada, Las Vegas Libraries. Keywords Administrative data processing, Technical services, Librarians, Information management Abstract Metadata schema and standards are now a part of the information landscape. Librarianship has slowly realized that MARC is only one of a proliferation of metadata standards, and that MARC has many pros and cons related to its age, original conception, and biases. Should librarianship continue to promote the MARC standard? Are there better metadata standards out there that are more robust, user-friendly, and dynamic in the organization and presentation of information? This special issue will examine current initiatives that are actively incorporating MARC standards and concepts into new metadata schemas, while also predicting a future where MARC may not be the metadata schema of choice for the organization and description of information. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 6-7 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524440

6

Editorial: metadata and librarianship: will MARC survive?

Bradford Lee Eden

Library Hi Tech Volume 22 . Number 1 . 2004 . 6-7

retooled and revamped for specific projects; in others, MARC has been transformed into something more dynamic and useable for the foreseeable future. The first section (articles 1-4) deals with expanding the current uses of MARC by transforming it towards the needs of a particular project or group of users. Jacobs et al. describe the construction of a Perl program to add vernacular Russian (Cyrillic) characters to existing MARC records. Carini and Shepherd present a case study of how MARC and EAD have been used to address challenges in the archival area. Smith et al. discuss the evolution of MARC into a particular brand they call ENC MARC, to assist them with the challenges of cataloging materials for the Eisenhower National Clearinghouse. Andresen describes the development of the danMARC2 format, as well as the future of MARC in an XML world. The second section (articles 5-7) discusses the development and uses of the Metadata Encoding and Transmission Standard (METS) at the Library of Congress. Cundiff provides an overview of the development and uses of METS, as well as a short primer and relevant issues for the future. Proffitt discusses the current uses of METS at the Research Libraries Group. Yee and Beaubien describe the challenges of converting METS-encoded content for use in educational technology environments, through the Instructional management system-content package (IMS-CP). The third section (articles 8-9) discusses the development and uses of the Metadata Object Description Standard (MODS) at the Library of Congress. McCallum provides an introduction to MODS, a MARC21-compatible XML schema for descriptive metadata. Guenther discusses how to apply MODS as a metadata schema, as well as guidelines and current applications of the standard. Section four (articles 10-11), which for space reasons will be published in the next issue of Library Hi Tech along with sections five and six, presents information on MARCXML and the use of XML to transform the MARC standard for future use. Keith describes the development of the MARCXML schema, and the toolkit and tutorial developed at the Library of Congress for its use. Carvalho et al. provide details regarding the transformation of MARC into an

XML structure, its advantages, and how it can be a part of a more complete bibliographic framework. Section five (articles 12-14) presents case studies and discussions of repurposing or reexamining the MARC format in light of metadata management and workflow. Lubas et al. discuss the metadata practices developed at MIT for the OpenCourseWare Project. Farb and Riggio present information on the challenges of constructing a metadata schema for managing electronic resources. Kurth et al. provide interesting documentation regarding the mapping, transformation, and manipulation of metadata in digital collections at Cornell University. Finally, section six (articles 15-16) presents opinions on the future of the MARC format, and what is really needed in terms of a metadata standard for information organization and description in the future. Coyle performs a ‘‘thought experiment’’ on the concept of a bibliographic record based on the functional requirements for bibliographic records (FRBR) within the context of a flexible, functional library systems record structure. Tennant, a well-known futurist and writer on the complexities and failings of the MARC format, discusses what he envisions as the new kind of metadata infrastructure needed by libraries in the future. Will MARC be a viable metadata format for the future? In terms of its continued use and availability in library OPACs throughout the world, yes; in terms of its future competing against the likes of Dublin Core (DC), encoded archival description (EAD), text encoding initiative (TEI), instructional management system (IMS), and the scores of other current and developing metadata standards, perhaps. As this special issue illustrates, the development of METS, MODS, and MARCXML are meant to transform the advantages of the MARC format into the digital environment. Whether they are used or accepted by the library community will depend on marketing, instruction, training, and their eventual use in library digital projects. In other words, the success of these MARC-based metadata schemas are dependent on libraries and librarians themselves. In that sense, the ‘‘ball is in our court’’. 7

While other formats vie to replace it, MARC will still be with us for some time to come. Few of us can truthfully say that we have exploited the full potential of the MARC standard. When we try to use MARC to its fullest, we may be surprised at how far we can go. The newly created program, Cyril, demonstrates both the untapped potential of some of our existing standards and some of their striking shortcomings. Cyril also highlights how open source software such as Perl and the MARC::Record extension may be used to enhance and leverage existing MARC data in new ways. MARC 21 specifications for character sets provides for multiple non-Roman alphabets: Arabic, Chinese, Japanese, Korean, Hebrew, Greek and Cyrillic under MARC-8 coding. OCLC offers Chinese, Japanese and Korean (CJK) and Arabic scripts, but not Cyrillic or Hebrew. RLIN, on the other hand, offers CJK, Hebrew, Cyrillic and Arabic. MARC 21, at least theoretically, allows for records in universal coded character set (UCS/Unicode) (Lbrary of Congress, n.d.a.). Integrated online library system (IOLS) vendors all claim to be on the road to Unicode, but display and input capabilities, in many cases, remain vaporware. Meanwhile, few libraries are able to mount multi-script catalogs that cover even the entire MARC-8 range. Despite the availability of Cyrillic vernacular input in RLIN, it is not widely used. Of over one million Russian language records in RLIN, less than 5,000 contain Cyrillic vernacular characters. Why Cyrillic? From a customer service perspective, a significant and increasing

Theme articles Cyril: expanding the horizons of MARC 21 Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

The authors Jane W. Jacobs is Assistant Coordinator, Catalog Division, Queens Borough Public Library, Jamaica, New York, USA. Ed Summers is Applications Programmer, Follett Library Resources. Elizabeth Ankersen is Senior Librarian, Queens Borough Public Library, Jamaica, New York, USA.

Keywords Languagues, Russian, Records management, Libraries Abstract Describes the construction of the author’s Perl program, Cyril, to add vernacular Russian (Cyrillic) characters to existing MARC records. The program takes advantage of the ALA-LC (Library of Congress, 1997) standards for Romanization to create character mappings that ‘‘de-transliterate’’ specified MARC fields. The creation of Cyril raises both linguistic and technical issues, which are thoroughly examined. Concludes by considering the implications for cataloging and authority control standards as we move to a multilingual, multi-script bibliographic environment.

Received August 2003 Revised September 2003 Accepted November 2003 Special thanks to Andy Lester who provided the Perl solution to the DRA extract problem, LC Slavic expert, Pamela Perry and members of SLAVLIB, who helped with the list Russian names, Galina Pershman, Olga Davydova and Galina Meyerovich, native Russian speakers who reviewed the output, Fred Fishel, David Hill, Malabika Das, Syed Rasool and Desire Simmons, IT colleagues who helped Cyril through many technical difficulties and to Stuart Rosenthal who championed the project at the Queens Library. # Jane W. Jacobs, Ed Summers and Elizabeth Ankersen.

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 8-17 Emerald Group Publishing Limited . ISSN 0737-8831

8

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

Record selection and character mapping

Russian-speaking customer population makes enhancing records with Cyrillic vernacular a priority. At the Queens Borough Public Library, the multilingual catalog is already a reality and multi-script capabilities are gaining ground. In the library’s collection over 90 languages are represented and substantial collections exist in more than 30. The current DRA Classic system is capable of displaying all the MARC 21 character sets. Already the Queens Library is providing records in Chinese, Japanese, Korean, Arabic, Persian and Urdu vernacular. As noted above, MARC 21 allows for the storage of vernacular Cyrillic and Hebrew; however, OCLC’s WorldCat database has not yet provided this input facility. Even if the input capability instantly became available, any library with a substantial collection of Russian-language materials would face the question of how to deal with their back file. The Queens Library collection contains more than 13,000 Russian language records. The thought of setting Slavic catalogers to revamping Russian language catalogs manually seems, at best, impractical. And is it really necessary? The idea of ‘‘de-transliteration’’ is not new. OCLC’s Arabic software offers a field by field algorithm for adding vernacular equivalents to existing transliterated Arabic language MARC fields. Could not one just ‘‘de-transliterate’’ Russian in batch mode? Ms Ankersen, Queens Library’s Slavic cataloger, has long believed that a well thought-out program could provide the vernacular Cyrillic through ‘‘de-transliteration’’. The bibliographic information is all there, albeit in transliteration. On that premise Cyril was born. There are several linguistic reasons why the Russian language should be the best candidate for a machine conversion. Although the origin of the Cyrillic alphabet is obscure, it is closely related to the Greek alphabet and shares some important characteristics with Latin scripts. Russian: . is consistently written in Cyrillic; . uses a limited number of characters (less than 40); . includes vowels in written form; . uses the same direction of writing as Latin scripts (left to right); . uses capital and small letters; and . uses the same punctuation as Latin scripts.

In setting up Cyril the authors considered what conditions must be present in a MARC record to insure its suitability for ‘‘Cyrillicization’’. An obvious disqualification is the presence of an 880 field, indicating vernacular characters are already present. If field 066 or 880 is present, Cyril will not process the record. Cyril deliberately excludes non-Russian records and will not process any record fixed field language code (008 positions 35-37) that is not ‘‘rus’’. Other Slavic languages, including Ukrainian, Belarusian, Bulgarian and Serbian all use the Cyrillic alphabet, but with slightly different transliteration schemes. There are also a number of non-Slavic languages which have adopted the Cyrillic alphabet and also apply different Romanizations. This follows the authors’ experience with OCLC’s Arabic de-transliteration algorithm. It is generally quite reliable for Arabic but has bugs when applied to Persian or Urdu, because although Urdu and Persian are written in the same script, they are not transliterated precisely the same way. MARC 21 character sets and their equivalent coding are laid out quite conveniently on the Library of Congress (n.d.a.) Web site under: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. The authors began working with the Basic Cyrillic character set. Deep into the project, the authors realized that it is also necessary to invoke the Extended Cyrillic character set. For the most part, the extended Cyrillic character is needed for representing Slavic languages other than Russian, such as Ukrainian, Bulgarian, Belarusian, etc. or non-Slavic languages which use Cyrillic characters. However, the Russian letter ‘‘e¨’’ can be invoked only from the extended character set. In Russian it is not constructed with a combining umlaut. Furthermore, although relatively uncommon, it is essential in correctly representing a few common words such as, ‘‘ee¨’’ (her). The American Library Association-Library of Congress (ALA-LC) standards for Romanization have been used routinely by American libraries, and are also conveniently available on LC’s Web site. Hence this standard provides the obvious source for reversing the transliteration process. The first step was to 9

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

Table I Roman (English) Bb Vv Gg Dd Ll Mm Nn Oo Pp Rr Ff Yy Ee E¨ e¨

MARC 8-bit code (as G0)

Unicode character system

MARC 8-bit code (as G0)

Cyrilic equiv.

Unicode character system

Notes

42 56 47 44 4C 4D 4E 4F 50 52 46 59 45 68 45 67 45 49 65 49 66 49 5A

62 76 67 64 6C 6D 6E 6F 70 72 66 79 65 68 65 67 65 69 65 69 66 69 7A

0042 0056 0047 0044 004C 004D 004E 004F 0050 0052 0046 0059 0045 0308 0045 0307 0045 0049 0304 0049 0306 0049 005A

0062 0076 0067 0064 006C 006D 006E 006F 0070 0072 0066 0079 0065 0308 0065 0307 0065 0069 0304 0069 0306 0069 007A

XX BB  XX XX MM HH OO  PP  XX Ee E¨ e¨

62 77 67 64 6C 6D 6E 6F 70 72 66 79 65 64

42 57 47 44 4C 4D 4E 4F 50 52 46 59 45 44

0411 0412 0413 0414 041B 041C 041D 041E 041F 0420 0424 042B 0415 0401

0431 0432 0433 0434 043B 043C 043D 043E 043F 0440 0444 044B 0435 0451

Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition Simple transposition (never an initial letter) Simple transposition Extended Cyrillic character set (umlaut e) (never an initial letter)

XX

7C

5C

042D

044D

(superior dot)

XX X

69

49

b

b

0418 04E2

0438 04E3

Simple transposition Not available in MARC-8 (macron)

XX

6A

4A

0419

0439

(breve)

XX

7A

5A

0417

0437

The common denominator here is the ‘‘h’’. There is no letter ‘‘h’’ in Russian so the transliterated h always modifies the letter before it

5A 68 4B 4B 68 43 68 53 53 68 53 68 43 68 54

7A 68 6B 6B 68 63 68 73 73 68 73 68 43 68 74

005A 0068 004B 004B 0068 0043 0068 0053 0053 0068 0053 0068 0065 0068 0054

007A 0068 006B 006B 0068 0063 0068 0073 0073 0068 0073 0068 0065 0068 0074

XX

76

56

0416

0436

XX XX

6B 68

4B 48

041A 0425

043A 0445

XX

7E

5E

0427

0047

C X

C

73 7B

53 5B

0421 0428

0441 0448

XX

7D

5D

0429

0449

T

74

54

0422

0442

Aa 41 Uu 55 ‘‘T’’S ‘‘t’’s 6B 54 6C 53 T U ‘‘i’’ u 6B 49 6C 55 T A ‘‘i’’ a 6B 49 6C 41

61 75 6B 74 6C 73 6B 69 6C 75 6B 69 6C 61

0041 0055 FE20 0054 FE21 0053 FE20 0049 FE21 0055 FE20 0049 FE21 0041

0061 0075 FE20 0074 FE21 0073 FE20 0069 FE21 0075 FE20 0069 FE21 0061

Aa Yy XX

61 75 63

41 55 43

0410 0423 0426

0430 0443 0446

XX

60

40

042E

044E

XX

71

51

042F

044F

XX Ii XX XX Zz Zh zh Kk Kh kh Ch ch Ss Sh sh Shch shcha

Tt

T

Here the ligatures (appearing around the first letter in the combination, e.g. ts) must alert the program to the different transliteration

(continued)

10

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

Table I Roman (English)

MARC 8-bit code (as G0)

T E ‘‘i’’ e 6B 49 6C 45 ’’ 37 ’ 27

6B 69 6C 65

Unicode character system FE20 0049 FE21 0045 02BA 02B9

FE20 0069 FE21 0065

Cyrilic equiv. XX

X X

MARC 8-bit code (as G0) 70

50

Unicode character system 0462

5F 58

Notes

0463

Extended Cyrillic character set

044A 044C

(hard sign) Ignore the possibility of capital (soft sign) Ignore the possibility of capital

Notes: a Note that the four letter combination ‘‘shch’’ is one letter ‘‘X’’ but the reverse combination, ‘‘chsh’’ would be the two letters: ‘‘X X’’. b These are used only in old orthography and are considered obsolete. Cyril ignores them. Luckily they are rare!

develop a mapping from transliteration back to Cyrillic. The characters are divided into classes by the complexity required for their interpretation. Capital and small versions of the same letter are shown together because they consistently behave in the same way. The first and most obvious are simple transpositions. For example, B = X and b = X. Next are vowels. The greater number of Russian vowels are represented both by individual Latin vowels and by vowels plus diacritical marks. For example: E = E and e = e but X = X and X = X. Similarly some consonants are used both individually and in combination with others to represent different Cyrillic letters. For example: S = C and s = c, but Sh = X and sh = X, and Shch = X and shch = X. Perhaps the most complicated metamorphoses are those characters which are constructed of two Latin characters connected by ligatures. For example, T = T and t = t and S = C and s = c but ‘T’S = X and ‘t’s = X (see Table I). The final key part of the transliteration mapping is punctuation and numerals. This is fairly straightforward. The MARC 21 coding for punctuation is nearly the same in the Latin and Cyrillic character sets. Brackets, however, are a notable exception. Coding for Latin bracket left and right brackets produced a X (Shch) or X (sh), respectively, in Cyrillic. This strange behavior took some time to account for and work around. After some research the authors discovered that a shift either from the Basic Cyrillic to the Extended Cyrillic Character set or back to Basic Latin was required. The left and right brackets (Unicode 005B and 005D, respectively) appear in Basic Latin (G0) as 5B and 5D and in Extended Cyrillic (G0) as 5B and 5D. MARC 21 reuses

5B and 5D in Basic Cyrillic (G0) for capital Shch (X) and small sh (X). The shift to Extended Cyrillic was difficult to accomplish in the first place and, worse, did not seem to work anyway for the brackets (at least in DRA Classic). The authors’ examination of several RLIN records with brackets confirms that the common practice is to make the shift back to Basic Latin. This seems to be a flaw, inherent in the design of MARC 21 which should be corrected. Until it is fixed, it is worth highlighting so that other developers may be forewarned. Certain characters in the Latin alphabet are never used alone to represent Cyrillic characters in transliteration: ‘‘C’’, ‘‘H’’, ‘‘J’’, ‘‘Q’’, ‘‘W’’ and ‘‘X’’. Despite the caution taken to select only Russian records for Cyrillicization, western words, phrases and names have inevitably crept into Russian publishing and hence into MARC records. The appearance of these characters signals a mixture of English and Russian which needs manual review. For example: Davis, Stephen R., 1956Learn Java now. Russian Programmirovanie na Microsoft visual Java + + / Stefen R. Devis; [perevod s angliiskogo pd red. S.V. Isakova].

Here the Cyrillic title should read: popaMMXpoBaHXe Ha Microsoft visual Java ++ . . . not popaMMXpoBaHXe Ha MXkpocoT BXcyaX JaBa + + . . . Cyril is designed to reject such a mixture of Russian and English for conversion. In the above example, Cyril would refuse to ‘‘de-transliterate’’ the entire 245 field. It will similarly refuse to operate on any field which contains ‘‘C’’, ‘‘H’’, ‘‘J’’, ‘‘Q’’, ‘‘W’’ or ‘‘X’’. 11

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

One fairly common set of exceptions to this premise involves Roman numerals, which appear with some frequency in Russian titles. Although the authors experimented with an algorithm to handle Roman numerals, they ultimately eliminated it as impractical. A major hindrance to dealing effectively with Roman numerals is the existence of two real Russian words: X (and) transliterated as ‘‘i’’ and B (into, to, at, etc.) transliterated as ‘‘v’’. It is impossible for the program to decide whether ‘‘i’’ and ‘‘v’’ represent Russian words or parts of Roman numerals. Which fields should be detransliterated? The available standards are not especially enlightening on this point. Any MARC field for which $6 is defined is a potential candidate for an 880 field and hence a conversion. This includes some rather surprising fields such as 020. Although some character sets include numerals different from those used in the Latin alphabet, Russian does not. Therefore an 880 field corresponding to the 020 in Cyrillic would be nonsensical. Common practice is to provide vernacular characters for the following fields: 1XX, 245, 246, 250, 260, 440, 490, 505, and 8XX and sometimes 6X0 and 7XX. However, this requires some cataloger judgment, which must be compensated for in a batch program. The authors quickly discovered that some subfields should not be detransliterated. A good example is 260 $c. Commonly $c contains a copyright date such as, ‘‘c2002’’. The result of Cyrillization would be a failure, because of the presence of the non-Russian letter ‘‘c’’. In practice catalogers simply reproduce $c of the 260 in the corresponding 880 field. Consequently the authors also created a ‘‘reproduce’’ option in Cyril which reproduces certain subfields without change in the corresponding 880. Cyril is not prescriptive about which fields or subfields it detransliterates and reproduces. The user is free to choose which fields he wants at the command line. The only artificial intelligence programmed into Cyril is the choice to skip over certain fields when the record is for a work in translation. The choices for Queens Library records were as follows.

Detransliteration 100y Main Entry – Personal Name – Not Repeatable $a, $q 110y Main Entry – Corporate Name – Not Repeatable $a, $b 111y Main Entry – Conference Name – Not Repeatable $a, $e 130 Main Entry – Uniform Title – Not Repeatable $a, $p 245 Title – Not Repeatable $a, $b, $c, $p 246 Alternate Title – Repeatable $a, $b, $p Do not transliterate for second indicator: 1 or 5 Do not transliterate if first subfield is: i 250 Edition – Not Repeatable $a 260 Publication Information $a, $b 440 Traced Series Title – Repeatable $a, $p 490 Untraced Series Title or Traced Differently – Repeatable $a 505 Contents Note – Repeatable $a, $t 700y Added Entry – Personal Name – Repeatable $a, $c, $q, $t, $p 710y Added Entry – Corporate Name – Repeatable $a, $b, $t, $p 711y Added Entry – Conference Publication – Repeatable $a, $e, $t, $p 730y Added Entry – Uniform Title – Repeatable $a, $p 740y Added Entry – Analytical Title $a 800 Series Entry – Personal Name/Title – Repeatable $a, $c, $q, $t, $p 830 Series Entry – Authorized form of Title – Repeatable $a, $p yDon’t transliterate if 041 field is present. Reproduce Dates associated with personal names: 100 $d, 700 $d, 800 $d Number of part/section of a work: 245 $n, 440 $n, 700 $n, 710 $n, 730 $n, 800 $n, 830 $n, Date of publication: 260 $c Volume number of a series: 440 $v, 490 $v, 800 $v, 830 $v Miscellaneous information: 505 $g

It is theoretically possible to ‘‘de-transliterate’’ the 6XX fields: 600 Subject Entry – Personal Name – Repeatable 610 Subject Entry – Corporate Name – Repeatable Subject Entry – Conference Publication – Repeatable 630 Subject Entry – Uniform Title – Repeatable

However, the authors deemed it to be unwise to try. Whereas it would work well with names like: 12

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

problem with the name itself. The geographic qualifier must be revised to produce accurate results:

610 10 Kommunisticheskaia partiia Sovetskogo Soiuza 610 10 KoMMyHXcTXXeckaX apTXX

CoBeTckoo CoXXa

Moskva Rossi‘i’a Mocka PoccXX

It did not work so well with: 600 10 Shakespeare, William, $d 1564-1616 $x Appreciation $z Russia

As against: Moscow Russia

The same thing can happen with non-Russian names which appear in Russian language. Shakespeare (who, himself, spelled his name in different ways) and Jeffrey Archer appear variously as:

Personal names Personal names presented a special challenge. The name by which a person is ‘‘commonly known’’ is sometimes difficult to discern in familiar scripts, but considerably more so in transliteration. Transliteration of names by their owners is generally done phonetically without regard for ALA Romanization standards. Furthermore, it is not even consistent across languages in Latin alphabets. In War and Peace XXep XeXyXoB (Pierre Bezukov) experiments with a French transliteration of his name variously as ‘‘Besouhoff’’ and ‘‘Besuhof’’ to satisfy the requirements of his Masonic numerology (ToXcTok, 1995).There are some striking examples of famous Russians whose names are known by an old form of transliteration whose spelling has been ‘‘grandfathered’’ in. Tolstoy, himself provides a good example: The correct (modern) transliteration would be:

Vil0 X‘i’am Shekspir Uil0 X‘i’am Shekspir

And: Dzheffri Archer Dzheffari Archer

Hence the authors decided to restrict Cyril to operating on the 1XX and 7XX fields only when field 041 was not present. With the help of colleagues from the Library of Congress and from the SLAVLIB listserv, the authors established a list of almost 100 names of prominent Russians whose authorized forms do not correspond to the ALA-LC transliteration of their names. The authors included a loop to seek out and replace names with their correct Russian form. They also provided separately for changing Moscow and Russia to their correct Cyrillic form.

TolstoX, Lev, $c graf, $d 1828-1910 ToXcToX, XeB, $ c rpa, $d 1828-1910

As opposed to (the name by which he is commonly known, and LC authority): Tolstoy, Leo, $c graf, $d 1828-1910

Housekeeping

There are also more modern examples such as Boris Yeltsin:

Cyril takes care of a few housekeeping details. The fixed field, Modified Record (008/38) is automatically set to be blank. The 066 field is set to show ‘‘$c (N’’ in accordance with MARC 21 requirements. Cyril adds a Local Use (940) field to show which records it has operated upon: ‘‘$a Perl/Cyrillic’’. The log file logs records rejected for non-Russian language records (008/35-37 6¼ rus), records which already contain Cyrillic characters (066 present, or 880 present), or failure resulting from the presence of a non-Russian character.

El‘t’sin, Boris Nikolaevich, $d 1931EXXXXH, XopXc HXkoXaeBXX, $d 1931-

As against:. Yeltsin, Boris Nikolayevich, $d 1931-

Obviously reverse transliteration will give a false result here. After consulting experts in the field, the authors compiled a list of these ‘‘non-conformers’’ and established a subroutine to provide a correct Cyrillic form of their names. A similar problem crops up with corporate names, for example Institut istorii iskusstv (Moscow, Russia). Here there is no 13

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

Initial testing and data extracts

were able to clean up this problem with a Perl program. That done, only a small, but deadly, smattering of incorrectly coded and corrupted records were left that required manual clean-up. Although difficulties were specific to Queens Library data and the DRA extract program, the fact that the authors encountered problems should come as no surprise. Anyone who contemplates running Cyril, should start planning and testing his/her extracts early. What seems like a straightforward operation can be fraught with complications.

Even before finalizing Cyril’s design, the authors began to consider its actual operation. Initial testing was done by hooking up Cyril to a Web interface which allowed for easy testing without having to become instantly familiar with Cyril’s command line options. The interface allowed a user to specify a list of fields/subfields to de-transliterate, and to select a file of MARC records to upload. Once uploaded, the modified MARC records were presented (in interpreted format) along with statistics on which fields were successfully processed (this Web interface is available at http://marcpm. sourceforge.net/cyril). In its final form, Cyril operates on extracted data. In a large catalog this can prove to be a non-trivial issue. In the authors’ DRA Classic system, MARC records can be extracted by language only via a rather backward algorithm. Instead of extracting a specific language, one excludes all other languages (in the Queens Library catalog this was over 90 and still left a few stragglers). One immediate problem was with those records where no language code was present. This included many brief records and nearly all sound recordings. Excluding record types (leader byte 06): c i j and g (scores, non-musical and musical sound-recordings and projected media, respectively) takes care of most of the items where language coding was often appropriately left blank. To exclude brief records the extract program can reject records that lack specific fields; the authors chose 300. Finally, some records were excluded based on encoding levels (leader byte 17); the authors chose 5, 7 and O. For purposes of extracting records, the authors divided the database up into seven segments based on consecutive DRA system control numbers. The extracts contained 1,146, 1,271, 6,021, 2,850, 1,270, 241 and 585 records, the final two being deliberately short to serve as a first test of the full program. The authors tested the extracts by attempting to import them into CatME and, initially, failed miserably. The worst of the problem was that the DRA extract program inserted excess characters when it wrote the data to the file. With some helpful analysis from NYLINK and colleagues in the perl4lib listserv, the authors

Running the program – try this at work! Cyril is a Perl program which may be used in the Windows, Mac, or Linux/Unix environment. If you are working in Windows, you will most likely need to download and install the Perl programming language. To do so, visit the Comprehensive Perl Archive Network (www.cpan.org). In addition, you will need to install the MARC::Record module, which is a Perl extension for working with MARC data, as well as the Cyril program itself. Both are available from the Cyril section of MARC/Perl (marcprn.sourceforge.net/cynl. html). You run Cyril from the command line, and use the following options, which have been mentioned already: – in. Specifies the file of original MARC data. – out. Specifies a file to write the modified MARC data to. – select. A repeatable option which indicates that you would like to build an 880 field for a selected MARC data element. – select targets a specific field/subfield combination. – reproduce. A repeatable option that indicates you would like the 880 to contain the particular field/subfield combination as is. The option is useful in situations where you want to transliterate a field, and top copy certain subfields unchanged into the 880. – log. Specifies a file to use for a log. If not specified the log information will be printed on the screen.

For example, at Queens Public Library the following approximates the options used to run Cyril in a Windows environment: C:Record-1.25>perl cyril.pl –in = FIXED1.fil – select = 100aq – reproduce = 100d – select = 110ab – select = 111ae – select = 130ap

14

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

– – – – – – – – – – – –

their names right. A total of 285 records were not processed because they were not Russian or already contained 066 or 880 fields. Regrettably, Cyril also generated 3,500 ‘‘errors’’. These are more accurately regarded as ‘‘refusals’’ or ‘‘rejections’’. Cyril’s 3,500 errors logged are 3,500 fields that were rejected for transliteration, based upon the appearance of unacceptable characters such as, ‘‘C,’’ ‘‘c’’, ‘‘H’’, ‘‘h’’, ‘‘J’’, ‘‘j’’, ‘‘Q’’, ‘‘q’’, ‘‘W’’, ‘‘w’’ or ‘‘X’’, ‘‘x’’. This reflected the design philosophy that Cyril should perform transliteration only where it would be relatively certain to produce correct results and reject situations where results would be questionable or wrong. Cyril’s true mistakes are not well documented. Short of reviewing all 13,099 records, this would be impossible. The testing and sampling so far, however, has confirmed that Cyril’s performance was quite reliable. Cyril revealed far more examples of incorrect coding in the transliterated records than actual failures. One of the most subtle (and common) was the misuse of ‘‘˘’’ hacek/caron (UCS 030C) for ‘‘˘’’ breve (UCS 032E). Although Queens Library staff are well aware of the correct usage, it is difficult to spot in member-contributed OCLC copy. The only real bug discovered, post conversion, was with repeatable fields. When Cyril rejected a repeatable field (such as 700) for Cyrillization, it then fails to Cyrillicize subsequent iterations of that field. In subsequent updates to Cyril this bug will be fixed.

select = 245abpc – reproduce = 245n select = 246abp – select = 250a – select = 260ab reproduce = 260c – select = 440ap reproduce = 440nv – select = 490a reproduce = 490v – select = 505at reproduce = 505g – select = 700aqtp reproduce = 700dn – select = 710abtp reproduce = 710n – select = 711aetp selec = =730ap – select = 740a – select = 800aqtp reproduce = 800dnv – select = 830ap reproduce = 830nv – out = Cyrized1.dat log = logCyrized1.txt

Here the input file is ‘‘FIXED1.fil’’, the output file is ‘‘Cyrized1.dat’’ and the log file is ‘‘logCyrized1.txt’’. The job statistics (and fatal errors) are reported on screen. It is easy to preserve these statistics by copying from the screen to a text file. An important consideration before running Cyril on your data and loading it back to your catalog is back-up. The authors were careful to save the original extract files so that the database could be restored to its original condition if unforeseen problems arose. The authors did extensive testing of a few records at a time. Once satisfied that Cyril was working correctly, a test database was set up for large batches of the newly-enhanced records. When that proved unfeasible, the authors had to ‘‘fly blind’’ and load the Cyrillicized records directly into the database, overlaying their original counterparts. By virtue of hard work and luck this approach worked. Running the DRA program, Check MARC Analysis showed the Cyrillic records to be without corruption. Different IOLS offer different validation and analysis tools. It is essential to plan how you will check your newly-enhanced records and that you’ve selected the appropriate tool.

Implications for MARC and its successors MARC has been tuned up a bit in recent years to accommodate greater linguistic diversity. As mentioned above, MARC 21 recently authorized use of the Universal Coded Character Unicode and has issued some specific guidelines for such usage. If MARC is showing its age linguistically, it is at least partly because the industry for which it is a standard is still playing catch-up with its Unicode implementation. In the 040 field a $b has been added for coding the language of cataloging, both in bibliographic and authority records. This reflects the text portions of the record, such as notes and subject headings. It is interesting to

Results The final results were impressive: . 13,099 records enhanced; and . 50,193 880 fields added Despite fears about Anglicized Russian names, Cyril made only 187 name conversions in the database, 175 in 100 fields and 12 in 700 fields. This relatively small number nevertheless represented many of the most important figures in Russian culture, so it was well worth getting 15

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

our Romano-centric practices. The question of whether to continue the practice of providing transliteration + vernacular has already been posed. MARC 21 allows input of records with only vernacular characters and no transliteration, but then fields must still be tagged with $6 880 + [sequential number]. Is transliteration necessary for those whose hardware/software restricts them to viewing only Latin alphabets? Is transliteration helpful to the native speaker, or just confusing? Problems quickly develop when one considers the implications for authority control. Theoretically, 880 is defined in MARC authorities. However, in other documentation MARC 21 Authority: LC Guidelines (the blue pages) specifically disallows its use by LC and Name Authority Cooperative (NACO) and Subject Authority Cooperative (SACO) participants (Library of Congress, n.d.a.). In practice, there are few if any systems that would handle correctly or even accept such a record. In any case, a multi-script authority record logically implies more than a single correct heading or 1XX field. This surely would strain the capabilities of MARC 21. An excellent effort at resolving this problem has been made at the Hong Kong University of Science and Technology Library (Lam and Kwok, 2002). Their strategy for dealing with multiple scripts and different Romanizations is well thought out and worth pursuing. It is worth noting, however, that their choice of XML format is probably not requisite. The same approach may well be technically feasible in MARC. Could we persuade the governors of MARC to accept it? Or better yet our vendors? Changes to MARC proceed slowly and cautiously at best, but neither are vendors rushing to implement XML products. Clearly improvements need to be made to MARC to accommodate multi-script data more logically. First and foremost, authority records need to be rethought with multiple script and transliteration forms of names in mind. Cyril’s list of headings for ‘‘LC conversion’’ represents a primitive form of authority control, and would be unnecessary if the national authority database contained the necessary information and local system authority control were up to the job. The fixed field coding should also be rethought to indicate, not only language, but

note that $b is non-repeatable, essentially discounting the possibility of entering textual portions in more than one language. There has always been a single language code in the fixed field (008/35-37), which is typically insufficient for bilingual or polyglot materials. Dictionaries fared poorly in Cyril when their titles contained non-Russian language equivalents and there was no good signal to exclude or re-track them. Furthermore the single character ‘‘Modified Record’’ (008/38) is the only place to enter coded data about Romanization. It is both insufficient and terribly dated. The available codes to describe Romanization are: # [blank] – Not modified. o – Completely romanized/printed cards romanized. r – Completely romanized/printed cards in script.

The obsolescence of these definitions for OPACs has resulted in inconsistent usage and made the coding virtually useless. The 066 field provides for coding to show the character sets present and directionality. It would not be entered in a Unicode record. Field 041 makes up for some of the deficiencies of the fixed field, and has just been changed to make subfields repeatable. This manages language coding much better. Unfortunately, language coding implies a logical but inherently false equivalence with character sets. Serbian, for example, is nearly always written in Cyrillic characters, but occasionally appears in Latin characters. Punjabi is also customarily written in Gurmurkhi script, but sometimes in Arabic/Urdu. MARC offers no convenient way to code the combination of language(s) and script(s) that can be present in one record. As a result, catalogers depend on the 546 note field to describe multiple languages and scripts. This data is human-legible and helpful to customers, but is inadequate for batch programs like Cyril. 880 tagging remains an inelegant, cumbersome way of handling non-Roman alphabets. The sequential numbering required in $6 to link 880s has proved to be only a minor hurdle for Cyril; however, it complicates the structure of MARC considerably. On the other hand, it is currently essential within OPACs to generate a correct vernacular display. With the increasing globalization of cataloging, we will ultimately have to abandon 16

Cyril: expanding the horizons of MARC 21

Library Hi Tech Volume 22 . Number 1 . 2004 . 8-17

Jane W. Jacobs, Ed Summers and Elizabeth Ankersen

Library of Congress (n.d.a.), MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media: Character Sets: Part 2 UCS/Unicode Environment, available at: www.loc.gov/marc/ specifications/speccharucs.html (accessed 26 August 2003). Library of Congress (n.d.a.), MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media: Character Sets: Part 3 Code Table Basic Cyrillic, available at: http://lcweb2.loc.gov/cocoon/codetables/ 4E.html (accessed 26 August 2003). Library of Congress (n.d.a.), LC Guidelines Supplement to the MARC 21 Format for Authority Data, available at: www.loc.gov/catdir/pcc/naco/trainers/marcblu02.pdf, 880, p. 1 (accessed 26 August 2003).. Library of Congress (1997), ALA-LC Romanization Tables, Library of Congress, Cataloging Distribution Service, Washington, DC, available at: http://lcweb.loc.gov/ catdir/cpso/romanization/russian.pdf (accessed 26 August 2003). ToXcToX, XeB (1995), BoXHa u Mup. KapaBeXXa, CaHKT

script and specific Romanization scheme. Ideally, we should reject 880 tagging in favor of making some or all fields repeatable in multiple scripts. Fields containing different iterations of the same information should still be linked, but we should reject the primary Roman/secondary non-Roman script approach that 880 fields imply, particularly where the script of the title cataloged is non-Roman. Ironically strong, sometimes inflexible, adherence to standards is what makes Cyril work in the first place. Without a standard Romanization, consistently followed, Cyril or any such conversion utility would be doomed to failure. On the whole, the authors’ experience with MARC and its accompanying standards suggest that the stability and/or ossification of MARC (depending upon your perspective) is more often the result of our need to maintain a workable standard and a working OPAC than by any inherent inflexibility of MARC. Looking forward, the authors hope to recreate Cyril for other Slavic languages and other MARC 21 scripts. XML files of character maps and names might provide an easy way to plug in the differing values needed for each language. Once the transition to UCS MARC 21 is made, the possibilities expand exponentially. It is even conceivable that a Cyril-like program could be written to provide transliteration of vernacular characters that had been directly entered. Someday, perhaps even soon, MARC will probably indeed be replaced, but we should not imagine that we must rush it out of existence or wait for its demise to think up and try out new ideas.

eTepXypX, ToMa. III, XacTX epBaX, Vol. XIX, p. 81. Unicode Consortium (????), Unicode Home Page: Code Charts, PDF version, Internet, Unicode Consortium, Mountain View, CA, available at: www.unicode.org/ charts/ (accessed 26 August 2003).

Further reading Gilliam, R. (2002), Unicode Demystified: A Practical Programmer’s Fuide to the Encoding Standard, Addison-Wesley, Boston, MA. Library of Congress (1997), LC Guidelines Supplement to the MARC 21 Format for Authority Data, Library of Congress, Washington, DC, available at: www.loc.gov/ catdir/pcc/naco/trainers/marcblu02.pdf (accessed 26 August 2003). Library of Congress (2002), MARC 21 Concise Format for Authority Data, concise ed., Library of Congress, Washington, DC, available at: www.loc.gov/marc/ authority/ecadhome.html (accessed 26 August 2003). Library of Congress (2002), MARC 21 Concise Format for Bibliographic Data, concise ed., Library of Congress, Washington, DC, available at: www.loc.gov/marc/ bibliographic/ecbdhome.html (accessed 26 August 2003). Library of Congress (2002), MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media, Library of Congress, Washington, DC, available at: www.loc.gov/marc/specifications/spechome.html (accessed 26 August 2003).

References Lam, K.-T. and Kwok, L. (????), ‘‘XML name access control metadata repository: an experiment at HKUST Library’’, Hong Kong University of Science and Technology, Hong Kong, available at: http://library. ust.hk/info/reports/xmlnac.html (accessed 26 August 2003).

17

The machine-readable cataloging (MARC) format should have a price on its head, according to Tennant (2002). In his article ‘‘MARC must die’’, Tennant (2002) decries MARC as antiquated and claims that it has outlived its usefulness – and while he is perhaps a little premature in his call, he is very likely correct. Archivists have had mixed feelings about MARC since the development of the Archives and Manuscripts Control (MARC-AMC) format in 1983, in part due to its origin as a fundamentally bibliographic standard. Ironically, it is the adoption of the MARC-AMC format, and resulting standardization of descriptive information, that may make it possible for some small archives to move into the new age of Extensible Markup Language (XML) fairly painlessly. This case study will discuss the MARC standard in the context of archival description, and follow the evolution of descriptive practices and use of standards at the Mount Holyoke College Archives and Special Collections. This evolution includes the automation of collection description using a database compliant with MARC-AMC, export of finding aids from the database into HyperText Markup Language (HTML), and procedures currently in use to convert records in the database from MARC binary files into Encoded Archival Description (EAD). In detailing these experiences, the study will demonstrate how greater standardization of archival information facilitates the ability to adopt new standards and technologies.

The MARC standard and encoded archival description Peter Carini and Kelcy Shepherd

The authors Peter Carini is Director of Archives and Special Collections at Mount Holyoke College, South Hadley, Massachusetts, USA, and a Part-time Lecturer in Archival Studies for the Simmons College Graduate School of Library and Information Science. Kelcy Shepherd is the Project Director, Five College Finding Aids Access Project, University of Massachusetts, Amherst, Massachusetts, USA. Keywords Archives, Descriptive cataloguing, Standards, Administrative data processing Abstract This case study details the evolution of descriptive practices and standards used in the Mount Holyoke College Archives and the Five College Finding Aids Access Project, discusses the relationship of EAD and the MARC standard in reference to archival description, and addresses the challenges and opportunities of transferring data from one metadata standard to another. The study demonstrates that greater standardization in archival description allows archivists to respond more effectively to technological change.

Archival description and the MARC standard Archival description is a term used by archivists to denote the process of capturing, exchanging, and providing access to information about sets of historical records (Miller, 1990). MARC cataloging is just one aspect of this process. The primary descriptive tool used by archivists for decades has been a register or inventory now ubiquitously, and somewhat erroneously, lumped under the very broad term finding aid. A finding aid is, in fact, any tool that assists with

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 18-27 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524468

Received 2 September 2003 Revised 24 September 2003 Accepted 7 November 2003

18

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

before they were actually printed and made available. MARC has many limitations for archives, above and beyond the issues of granularity, extensibility, and technological obsolescence noted by Tennant. Some of its limitations are associated with its development as a format for the description of books, while others have more to do with the systems used to make it available to users. One of the main problems with MARC in the archival environment is that, while it serves the needs for collection level description fairly readily, it does not lend itself to the more detailed parts of the description as easily. While it is possible in MARC to associate series level records with each other and with the collection level record, it is clunky in practice. This is similar for container lists. Problems associated with cataloging systems include limits on the number of fields and characters imposed by systems such as WorldCat, the Online Computer Library Center (OCLC) union catalog. Limiting the length of these records leads to difficulty in fully representing the collection using notes fields, while still having enough fields for added entries and subject headings. Further problems arise with the display of the archival MARC records in OPACs: the creator may be listed as the author, or the various note fields (scope and contents, biographical sketch, provenance, etc.) may all have the same generic label ‘‘note’’ (Spindler and Pearce-Moses, 1993). Despite these problems and questions about users’ abilities to actually find and interpret MARC records representing archival materials, archivists have worked hard to find ways to adapt their practices to the system, or more commonly, to adapt the system to their practices.

discovering and using archives and manuscript materials, and thus includes everything from repository guides to catalog cards describing individual items housed in a repository. Inventories and registers themselves are found in a variety of formats, but have become more regular as archivists adopt descriptive standards such as MARC and EAD. They also increasingly tend to represent collections, record groups or series. In other words, archives have moved away from the nineteenth-century manuscript library practice of describing individual items. The modern-day versions of these inventories and registers usually have a combination of standard data elements and narrative notes. The recommended minimum standard consists of a main entry or title, dates covered in the materials, extent or cubic/linear feet of the collection/record group, a biographical sketch or historical note (providing the user with some background on the creating entity), a note describing the scope and contents of the collection/record group, and a statement about arrangement (Tibbo and Meho, 2001). Other components may include a note on the provenance of the material, a list of series (subordinate parts, such as correspondence, writings, legal papers), descriptions of the series, and a container list detailing the contents of each box and folder. Librarians developed MARC as the primary descriptive tool for creating metadata for books in online public access catalogs (OPACs). Archivists, recognizing its possibilities and having no electronic descriptive standard of their own at that time, adopted it mostly as a secondary method for describing collections, record groups and series while maintaining paper finding aids as the primary descriptive tool. Archives commonly implemented MARC to gain representation in library catalogs, to facilitate both subject searching and access in the hope that students and faculty performing searches for books would stumble on the descriptions of primary sources in the process. Prior to the advent of the Internet, it was the only cost-effective method for archives to disseminate descriptions of their collections on a national level. Some large repositories were able to produce published guides to their collections, but this was generally outside of the means of smaller institutions. In addition, the published guides soon became out-of-date, often even

Automating the Mount Holyoke College archives The Archives and Special Collections at Mount Holyoke College is small: about 8,000 linear feet of college records and manuscript collections and 9,000 volumes of rare books. The staff consists of 22 full-time equivalent professionals supplemented by eight to ten undergraduate student workers and interns from the Simmons College Graduate School of Library and Information Science. 19

The MARC standard and encoded archival description

Peter Carini and Kelcy Shepherd

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

In 1995 the Archives and Special Collections, like most other repositories of primary sources, was at a crossroads. The Internet was a relatively new development. The college was in the process of forging a Web site that ran deeper than just a few pages, and there was a mandate to create a Web presence for the various areas within the College. At this point the department’s Web page was just that, a single page that incorporated text from the archives’ previous gopher site. Recognizing the potential for disseminating information about collections via this new tool, the Archives and Special Collections staff set out to enhance the Web site, with the specific goal of adding collection descriptions. At the time, what would become the EAD data structure standard had not yet been released for public beta testing (Pitti, 1998), so HTML was the solution broadly available for those repositories interested in publishing finding aids online. The finding aids for the collections were all in paper form, some typed on a typewriter and some created in a word processor. Even for the word-processed guides, almost no electronic files existed. The formats of the finding aids varied considerably, depending on the time period in which they were produced. Some of the earliest described more about how the archivist had decided to arrange the collection than what the collection consisted of or documented. Within these descriptions there was little consistency of style or language. Users primarily gained access to the collection through a card catalog for which the shelf list had not been consistently maintained. There was no collection management database. Moving from a paper-based system of description to a purely electronic system via the Internet looked to be an enormous task, one that far exceeded the resources available. In 1982, the Archives and Special Collections had begun to create MARC records for their manuscript collections, both for the local OPAC system and for upload to OCLC. There were about 330 descriptions of archival holdings in the OPAC in 1995; of these, about 230 represented collections. The remaining records represented individual items or sets of seemingly related record groups brought together to form an overview of certain institutional functions. The 230 collection descriptions were often

incomplete, representing only those parts of the collection deemed most important by the Archives and Special Collections, i.e. mainly correspondence or diaries, even if the collection consisted of additional materials such as photographs, scrapbooks or other documents. Still, these records seemed to be the best starting point for the creation of electronic finding aids. Given limited staff resources, it was imperative that the method for creating the online finding aids be integrated into the workflow and not require additional processing. Likewise, it was important that the data be created once and reused in as many ways as possible. Bowdoin College had such a model. After a visit to Bowdoin, the Mount Holyoke College Archives and Special Collections purchased a three-user license, network version of the Minaret database. Minaret, while not the most sophisticated archives management database on the market at the time, was both affordable and flexible. The database was populated by downloading the 230 partial records from the library OPAC. Archives staff and student employees then cleaned up the data and expanded descriptions over time. Staff, students, and interns continue to add new collection descriptions as new accessions are processed, or old collections or records series are re-processed or re-described. Still in use, Minaret is based on MARC-AMC, a format that has been eliminated with the implementation of format integration. A ‘‘finding aid’’ in the database consists of one or more MARC records, with the primary record at the collection or record-group level and additional linked MARC records representing series and container level information if necessary. Forms can be modified to hide the MARC structure from the user. This was important, since the Archives and Special Collections depends on student workers to perform minimal processing of collections and data entry. Teaching students how to work with the database is simple, but teaching the complexities of MARC would add an extra layer of work for the Archives and Special Collections staff. The database makes it possible for students to enter data unaware of MARC indicators, fields, and subfield delimiters. Yet the process creates 80 per cent of the MARC record, relieving 20

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

most cases, the result was the inclusion of more data than had been collected previously. For instance, the physical size of the collection in linear feet had never been recorded prior to the adoption of MARC. Likewise, a series list had not been created in most cases. Even more importantly, the consistent adoption of APPM to structure the descriptive data led to major changes in the depth and breadth of the description of the collections. First, the process of evaluating the collections as described led to the realization of the problems and inconsistencies in early descriptions. It also brought about a reevaluation of what level of description was necessary for which collections. While the extent of description varied within collections, there was no consistency behind the decisions with regard to how much description was afforded to which collections. Second, the adoption of APPM as a finding aid content standard meant that dates, date forms, and data regarding the creator, the extent, and physical size of the collection had to be consistent (although archives staff later discovered that they had not been as careful about consistent application as they had believed). The fact that the subject headings had to be drawn from or reflected in the scope and contents note (the 520 field) also led to changes in basic descriptive practices. Finally, due more to the use of student workers than to the effects of the adoption of MARC or APPM, information within the scope and contents note is now standardized as well. Because students wrote many of these as part of the redescription process, archives staff created and applied a standard structure for creating these descriptions. As a result, all of Mount Holyoke’s scope and contents notes begin with a list of all the forms of materials in a collection. This is followed by a general statement regarding what is covered in the collection, after which the notable or unusual aspects of a collection are recorded; this is followed, if applicable, by a list of correspondents.

Technical Services of all but the selection of subject headings and added entries. Authority work is integrated into the workflow, with staff checking name authorities at the time the data is entered into the system. Students and staff enter data following guidelines outlined in Archives, Personal Papers, and Manuscripts (APPM), a content standard for archival MARC records (Hensen, 1989). While there are provisions for making Minaret accessible via the Web, this route was bypassed in favor of uploading collection level MARC records to OCLC and then downloading them to the local OPAC. Making these records available in a national database and integrated into the OPAC, where students might find them while searching for published materials, was seen as an advantage. In addition, the search capabilities of the OPAC are greater than anything devised for Minaret. Series data and container lists are routinely created for collections that warrant this level of description. Though this information is also available in the MARC format in Minaret, it is not exported to the local or union catalog. Because Minaret is a simple, DOS-based, flat-file database, it is possible to ‘‘print’’ a form to a file. Forms can be created that consist of text and HTML markup surrounding database fields, and the resulting files exported directly to the Web server. Data from the database is also used to create HTML finding aids. This is where researchers can find the more detailed series and container level information not available in the library catalog, though the 856 field provides a link from the OPAC to the online finding aids. As noted previously, MARC is not a perfect solution for archival description, and finding ways to include all the parts of a finding aid in a MARC record is difficult (though not impossible). The Archives and Special Collections did, however, derive unforeseen benefits in the adoption of this standard. Using MARC forced staff to standardize both the structure of the finding aid and the data used to describe the collections. MARC’s stringent rules with regard to what data should go into which field eliminated duplication of data that had been prevalent in paper finding aids, and also compelled the staff to think about what kinds of data really belonged in a finding aid. In

The addition of Encoded Archival Description (EAD) The automation of archival description at Mount Holyoke College allowed the Archives 21

The MARC standard and encoded archival description

Peter Carini and Kelcy Shepherd

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

and Special Collections to rapidly produce collection level MARC records and publish complete finding aids in HTML on the Internet. It was not long, however, before the archivists of the Five College consortium (Amherst College, Hampshire College, Mount Holyoke College, Smith College, and the University of Massachusetts Amherst) began looking for a way to provide online access to finding aids from all of the institutions via a single, searchable interface. By this time, EAD Version 1.0 had been released and was being widely adopted by archivists eager to take advantage of the sophisticated searching capabilities made possible by this XML-based data structure standard. The desire of Five College history faculty for better access to information about the institutions’ primary sources and the growing acceptance of EAD in the archival community led to the development of the Five College Finding Aids Access Project, a consortial EAD implementation project funded by the Mellon Foundation. The project’s goals include the creation of both EAD-encoded finding aids and MARC records for uncataloged collections. Though many in the archival community questioned the need for continued MARC cataloging when full finding aids became readily available online, others argued that the benefit of attracting additional researchers by including MARC records in local and union catalogs outweighed the supplementary work involved in cataloging. (Encoded Archival Description Working Group of the Society of American Archivists, 1999). This was certainly the case at the Five Colleges, where four of the institutions are private liberal arts colleges whose primary mission is undergraduate education. Undergraduate students often use the library catalog as their first approach to research and may become aware of the existence of Five College primary sources only by discovering archival records in the catalog and linking to a finding aid on the Internet through the MARC 856 field. Though EAD is not currently considered to be a replacement for MARC records at the Five Colleges, new tools such as federated searching software and integrated library systems that accommodate metadata formats other than MARC may influence the need for archival MARC records in the coming years. Because

the archival finding aid can be unwieldy for display, some form of summary record – whether it be in MARC, MARCXML schema, or another format – may be desirable even in these environments. Initially, project participants hoped to produce MARC records from EAD to make workflow more efficient. This idea was eventually rejected because the two standards capture data at different levels of granularity, and the majority of finding aids did not include authoritative forms of creator names or subject headings. Issues with the granularity of encoded data were primarily found in the subject headings. EAD encodes the entire term including subdivisions as a single element, so breaking each component into the proper v, x, y, or z subdivision needed for MARC was problematic. Generally name authority work and subject analysis had not been done when creating the finding aids, so any record exported from EAD would not be complete and valid anyway. Other institutions had established procedures that utilized information from the EAD-encoded finding aid to create all or part of a MARC record. Transforming the EAD to a MARC text file that is pasted into cataloging software is an option described by Wisser and Roper (2003). Translation from MARC to EAD is less problematic, as authority work is commonly done as part of cataloging and the granular data elements can easily be concatenated in the conversion to EAD. A few archival institutions use tools developed to export some collection level information into the corresponding EAD finding aid. While many archival repositories adopted MARC-AMC to create collection level records for local and national union catalogs, the existence of complete finding aids in the MARC format like those at Mount Holyoke is unusual. To streamline the creation, modification, and maintenance of archival descriptions, it would be ideal if both could be produced from a single system where all editing takes place. The alternative requires updating a MARC record, print finding aid, and EAD-encoded finding aid when a collection accrues or other changes are made. 22

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

MARC to EAD conversion

DTD elements to EAD. Using the crosswalk, the Project Assistant developed an Extensible Stylesheet Language for Transformations (XSLT) stylesheet, shown in Figure 1, to convert the MARC XML DTD files to EAD. Most frequently used to render XML into HTML for display, XSLT can also be used for XML to XML conversion as in this case. The assistant also wrote Perl scripts necessary to automate the assignment of unique filenames and to insert the required entity references at the head of the resulting EAD file. (Entity references are not supported in the XSLT 1.0 data model, and therefore cannot be processed by the stylesheet.) The process that evolved consists of export from Minaret into MARC binary format, conversion of these files to the MARC XML DTD format, and transformation of MARC XML DTD to EAD using an XSLT stylesheet. Perl scripts enable both steps of the conversion process to occur with a single command and batch processing of multiple files. The process produces a MARC XML DTD instance and an EAD instance, shown in Figures 2 and 3. The EAD finding aid is published on the project’s Web site (http://asteria.fivecolleges.edu/), as shown in Figure 4. The project team converted an initial group of 50 finding aids to EAD and reviewed the output. The Project Director checked encoding for compliance with project guidelines, and validity against the EAD DTD. With the first batch of finding aids, it was clear that inconsistencies in the source data required cleanup before all of the finding aids could be converted in batch mode. Variances in punctuation from one finding aid to the next caused inconsistent output and complicated reuse of certain data elements, such as the creator’s personal name, which was also used in direct order to generate the element of the finding aid. Missing or incorrect indicators resulted in missing data in the EAD file, since the stylesheet tested for specific patterns to generate elements and attributes correctly. An abstract, useful in a searchable online environment because it allows the user to ascertain quickly the relevance of a collection to his or her research, was newly created for each collection. Once archives staff remedied these inconsistencies, the project team again

The Project Director and Mount Holyoke College Archivist originally planned to create forms for output of the Mount Holyoke finding aids into EAD, as was the procedure for exporting HTML finding aids from Minaret. Because Minaret is a flat file system, the export would require numerous forms to accommodate areas of the finding aid where there is a one-to-many relationship, for example series listings and subject headings. The process for exporting HTML files from Minaret utilizes two to nine forms, each producing an individual file that requires merging. EAD export in this manner would be even more complex because EAD encodes content, rather than display. Personal papers, institutional records, corporate manuscript collections, and conferences would each require different treatment. After mapping Minaret’s MARC-based fields to the project’s EAD template and realizing the complexity of export, it was clear that taking advantage of Minaret’s built-in capacity to export MARC binary files would offer a more efficient conversion solution. Because the Project Director had explored the possibility of deriving MARC from EAD, project participants were aware of the existence of the MARC XML Document Type Definition (DTD), and the tools available for conversion to this format. The project team was already in the process of developing the procedure in June 2002 when the Library of Congress announced the release of the MARCXML schema. This, combined with the fact that project staff were more familiar with DTDs, led to the use of the MARC XML DTD format instead. A variety of tools were then available to convert MARC files to MARC XML DTD. The Project Assistant tested options written in Perl and in Java, including the MARC::XML Perl module (http://search.cpan.org/author/ BBIRTH/MARC-XML-0.4/XML.pm) and the James application programming interface (www.bpeters.com/). The project team decided to use marc2xml.pl, the MARC to XML Conversion Utility script available from the Library of Congress (www.loc.gov/marc/ marcxml.html), because it had the most functionality out-of-the-box. After the team selected a conversion script, the Project Director mapped the MARC XML 23

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

Figure 1 Excerpt of XSLT stylesheet used to convert MARC XML DTD to EAD, showing handling of possible permutations of creator, title, and collection dates

converted the test group and made a few minor

Conclusions

changes to the stylesheet to accommodate the

In detailing one institution’s use of a MARC-based automation system to produce MARC, HTML, MARC XML DTD, and EAD outputs, this case study demonstrates the key role standardization plays in the ability to adapt to technological change. Traditionally archivists have been skeptical of using standards, in part because the unique nature of

newly-standardized data. Batch processing of finding aids is now a simple matter, but one made possible by a month of initial development time on the procedure and two months worth of data cleanup. 24

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

Figure 2 Excerpt of Beecher family papers finding aid in the MARC XML DTD format

(first HTML, now MARC in XML format and EAD) and may enable future needs for conversion and reuse of data that the archives staff have yet to foresee. Use of an internationally-supported data structure standard such as MARC also allowed the project team to benefit from the work of others by using conversion tools created by the Library of Congress. Adherence to content standards is necessary for complete machine-processed conversion. If the parentheses surrounding the bulk dates of a collection appear inconsistently from one finding aid to the next, there is no human eye involved in the conversion process to remove the secondary parentheses that are applied in the transformation. The existence and

archival and manuscript materials seems to require a fresh approach with each collection. As previously stated, this has begun to change with the adoption of data structure and content standards such as MARC, EAD, and APPM. (Hensen, 1986; Sahli, 1986). In an unmediated online environment, it is becoming clear that standardized description will improve usability of archival descriptions by new groups of users who may be inexperienced in archival research. The use of standards is also crucial for archival institutions to easily respond to new technologies and user demands. MARC has never been an ideal standard for archival finding aids. Its adoption did, however, allow the Mount Holyoke College Archives and Special Collections to capitalize on new technologies 25

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

Figure 3 Excerpt of Beecher family papers finding aid in EAD

collection’s finding aid into a Metadata Encoding and Transmission Standard (METS) record describing the digital surrogate of an item found in the collection. These conversions are possible for a technologically savvy non-programmer to do. This is a benefit in many archival institutions where extensive systems support isn’t available. Use of XML also creates an environment where it is easy to completely change the look and feel of hundreds of finding aids simply by changing a stylesheet, an advantage as archivists learn more about how novice users understand finding aids. Despite the reticence of archivists to adopt standard methods for the description of the materials in their care, the Mount Holyoke experience illustrates the need for not only following standard data structures, but also for regularizing the data that is put into the these structures. The early adoption of MARC and

consistent application of content standards at the time of finding aid creation make data cleanup unnecessary, or at least minimal. Use of controlled vocabularies in establishing creator names and selecting subject headings saved time during EAD conversion, as these entries had already been checked with authority files. Granted, the painful process of authority checking, modifying finding aids to comply with a consistent set of rules, etc. was still necessary at Mount Holyoke – six years earlier when the finding aids were initially entered into Minaret. The advantage is that all of the finding aids created since are in a standard form. The use of XML-based standards is also an indisputable advantage. It is relatively simple to convert from one XML DTD or schema to another, making reuse of all or part of a finding aid easier. For example, the archives staff may wish to export EAD data elements from a 26

The MARC standard and encoded archival description

Library Hi Tech Volume 22 . Number 1 . 2004 . 18-27

Peter Carini and Kelcy Shepherd

Figure 4 Beecher family papers finding aid published on project Web site

Repositories, Historical Societies, and Manuscript Libraries, Society of American Archivists, Chicago, IL. Miller, F.M. (1990), Arranging and Describing Archives and Manuscripts, Society of American Archivists, Chicago, IL. Pitti, D.V. (1998), ‘‘Encoded Archival Description: the development of an encoding standard for archival finding aids’’, American Archivist, Vol. 60 No. 3, pp. 268-83. Spindler, R.P. and Pearce-Moses, R. (1993), ‘‘Does AMC mean ‘archives made confusing’? Patron understanding of USMARC AMC catalog records’’, American Archivist, Vol. 56, pp. 330-41. Sahli, N.A. (1986), ‘‘Interpretation and application of the AMC format’’, American Archivist, Vol. 49 No. 1, pp. 9-20. Tennant, R. (2002), ‘‘MARC must die’’, Library Journal, Vol. 127 No. 17, pp. 26, 28. Tibbo, H.R. and Meho, L.I. (2001), ‘‘Finding finding aids on the World Wide Web’’, American Archivist, Vol. 64 No. 1, pp. 61-77. Wisser, K.M. and Roper, J.O. (2003), ‘‘Maximizing metadata: exploring the EAD-MARC relationship’’, Library Resources & Technical Services, Vol. 47 No. 2, pp. 71-6.

adherence to APPM has meant that Mount Holyoke has been able to move quickly and easily to adopt new XML-based standards. In addition, the interoperability of XML has already shown itself useful in allowing this institution to leverage the data created to fulfill the requirements of one standard (MARC) by reusing it in a new standard (EAD).

References Encoded Archival Description Working Group of the Society of American Archivists (1999), Encoded Archival Description Application Guidelines: Version 1.0, Society of American Archivists, Chicago, IL. Hensen, S.L. (1986), ‘‘The use of standards in the application of the AMC format’’, American Archivist, Vol. 49 No. 1, pp. 31-40. Hensen, S.L. (1989), Archives, Personal Papers, and Manuscripts: A Cataloging Manual for Archival

27

Directed under federal legislation of Public Law 101-589, Section 204, as part of the United States Department of Education’s Excellence in Mathematics, Science, and Engineering Education Act of 1990, The Eisenhower National Clearing-house (ENC) supports classroom learning and the educational community. ENC acquires materials, creates and organizes catalog records for these resources, and provides public access to the collection via an online searchable database. Materials are collected from federal and state agencies, commercial publishers, professional organizations, local school districts, and individuals. ENC’s goal is to facilitate educational reform by providing a collection of exemplary curriculum resources for mathematics and science K-12 learning. In addition to maintaining an online catalog of resources, ENC makes available for inspection and evaluation an on-site repository of physical materials.

MARC to ENC MARC: bringing the collection forward Janet Kahkonen Smith Roger L. Cunningham and Stephen P. Sarapata

The authors Janet Kahkonen Smith is the Digital Library Catalog Coordinator, Roger L. Cunningham is Senior Systems Development Engineer and Stephen P. Sarapata is Senior Software Development Engineer, all at Eisenhower National Clearing-house, The Ohio State University, Columbus, Ohio, USA.

Statement of work ENC materials are described and cataloged with extensive content and bibliographic data by an integrated team of experienced mathematics and science professionals and cataloging librarians. The richly described resources, both digital (Internet-based) and non-digital (print, videotape, CD-ROM, graphic material, DVD, and multimedia kits) are entered into the ENCdeveloped cataloging database. The metadata schema chosen by ENC is based on the USMARC cataloging framework. Added-value field extensions, developed by ENC for educational metadata, have been incorporated into the cataloging framework. Cataloging, using the long-standing standard USMARC and ENC MARC field extensions, has resulted in a collection of richly described curriculum resources. Each catalog record is divided into fields that are prefixed with a three-digit tag and a single character sub-tag

Keywords Standards, Administrative data processing, Archives Abstract This paper will describe the way in which the USMARC cataloging schema is used at the Eisenhower National Clearing-house (ENC). Discussion will include how ENC MARC extensions were developed for cataloging mathematics and science curriculum resources, and how the ENC workflow is integrated into the cataloging interface. The discussion will conclude with a historical look at the in-house data transfer from ENC MARC to the current production of IEEE LOM XML encoding for record sharing and OAI compliance, required under the NSDL project guidelines. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

Received 1 September 2003 Revised 22 September 2003 Accepted 7 November 2003 The authors wish to thank Kimberly S. Lightle, PhD and Judith S. Ridgway for their support in the development of this paper. # Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata.

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 28-39 Emerald Group Publishing Limited . ISSN 0737-8831

28

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

that identifies the data that follow. There is a field for author, a field for publisher, a field for edition, etc. A 245 tag, subfield a, for example, is always identified as the title field. The MARC records are stored, read by a computer program and given instructions to display the information contained in each tag on the online catalog. Programs have been purchased and written to assist the computer to search and retrieve the MARC records by browsing certain fields. Once the statement of work from the Department of Education was established, ENC began to plan how to locate, acquire, organize, describe, display, and maintain the collection of educational resources, and how to provide public access to them. The statement of work and the MARC framework spelled out in general what kind of data would be included in the bibliographic record created for each resource. Additionally, ENC wanted teacher input into what they thought were important fields of information to include in each catalog record. What exactly did educators feel was important to them for search and retrieval purposes? Deciding what kind of data to incorporate into the catalog record after receiving feedback from staff and the educational community was ENC’s next task. The ENC cataloging framework was structured to hold bits of information about each curriculum resource and was developed with many considerations in mind. The purpose of the record was to identify the materials in the collection, make them respond to an online search and able to be retrieved by users. The discussion at ENC centered on what kind of information should appear in the catalog records beyond what was dictated by MARC, who would input the data into the cataloging tool, and how the record should look and act online. At this point in the development of the ENC project, the technology needed for searching and display of content-rich Web pages was very limited. The Internet at the time was also limited, primarily being used by universities and governmental bodies. Web page design and browser capability were limited to HTML use and, though this environment worked well for sharing information that seldom or never changed, it did not lend itself

well to the advanced searching capability required at ENC. It was determined that the basic bibliographic record should include the data found in standard library-based cataloging: title, author and publisher, to list some. In 1993 ENC decided to adopt the USMARC framework as a starting-point for cataloging because MARC most closely fit ENC’s cataloging requirements: a traditional library cataloging schema that could be extended with additional data fields as ENC required. ENC felt that the full description of educational resources required the addition of field extensions to the MARC record, extensions that would give the ENC-cataloged record a particularly educational feel. The original cataloging fields used were dependent on the fields built into the Dynix Marquis (n.d.) cataloging system, and followed the USMARC tagging schema (see Table I). ENC soon felt the need to expand on the standard MARC fields above, because the MARC fields, while fairly comprehensive, did not address the educational scope of the ENC collection. Additionally, the educational data were what the ENC user community thought were important fields of information to include on the materials’ records. In 1996, concurrent development of the database and the cataloging tool was begun, and the following field extensions in Table II were added to the cataloging tool. Soon after deciding to use MARC, ENC chose another standard from the library field, Table I Dynix USMARC tags used by ENC Tag name

Tag

Tag name

Tag

Date ISBN ENC number ACQ number Title Variant titles Edition Publisher address Specifications Series Public notes Contents

008 020 090 092 245 246 250 260 300 440 500 505

Abstract Grade level Equipment Language Funding agency Subjects (base set) (ENC) Geographic focus Audience Author/personal names Author/corporate names Author/conference names Exemplary

520 521 538 546 599 650 651 699 700 710 711 770

Note: subfields are not shown

29

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

with this product, ENC decided against its use, instead directing efforts to a new relational database product named Microsoft SQL Server, version 6 (MSSQL6), released by the Microsoft Corporation. This product was marketed as a small business alternative to the more complex and expensive relational database systems in use at the time. Three features made the MSSQL6 of interest to ENC: the databases and server were easy to maintain, the system made use of the American National Standards Institute (ANSI) SQL, and the ability of the system to use non-proprietary Open Database Connectivity (ODBC) Drivers. As a result of the evaluation of the system and the goals set for the organization, ENC decided on developing its own cataloging system with MSSQL6 as the database backend. The front end of the cataloging system would be developed using a product named PowerBuilder, version 4 (Pb4) from PowerSoft Corporation (n.d.), which was chosen due to the wide variety of database interaction objects available with it. The next step at ENC was to analyze the workflow. The workflow encompassed many activities: . acquiring materials; . describing and indexing them; . making a set of information for each record available on a public Web site in a way that could be displayed, searched, and retrieved; and . maintaining a collection in-house of physical library resources.

Table II ENC field extensions Exemplary Revision date ENC record of supporting materials Vendor contact Standards Pedagogical type (non-Web type) Benchmark analysis Empirical data Review Online address Special ENC projects Keywords End of the record

770 009 596 597 658 695 696 697 698 856 996 997 999

the Anglo-American Cataloging Rules (AACR) (American Library Association, 1967). AACR2 provides an extensive set of cataloging guidelines. The goal was to create a catalog record that would truly describe the curriculum resource. The record is a container for the highly granular data necessary for optimal search and discovery of the resource.

Technology In 1994, after the initial research was done, ENC technology staff put together a client/server version of a product named FileMaker (FileMaker, 1994), which used a protocol developed by Apple Computer (AppleTalk) to communicate with the desktop computers at ENC. The system was slow with known database size constraints; however, this system allowed staff to move forward with the early creation of bibliographic records. The library automation system chosen was a commercial system provided by Dynix Marquis Inc. It comprised a Sybase database with access via an interface running on OS/2 workstations. It soon became apparent that, while the Dynix Marquis system could handle a large number of records, as well as the fully defined set of MARC tags, it was lacking in both extensibility and accessibility. These were critical issues impeding the progress of staff in getting the cataloged records on to the public site. During this time, some discussion revolved around the evaluation of a product based on the Z39.50 Information Retrieval Protocol. After a considerable amount of time experimenting

The working queues, then, had to allow the flow of materials through a multi-step process. Breaking down the workflow, categories of work were identified: selection of materials, ordering and acquiring materials, descriptive (content-based) and bibliographic (traditional library) cataloging, abstracting, editing the fields of data, making the records ‘‘live’’ on the Web site, and maintaining an onsite repository of this collection of materials. ENC materials, therefore, would be required to move forward in a linear progression as they followed the workflow, allowing one staff member to work on one record at a time but also allowing any staff member to view the record at any time. Writing the programming instructions for the workflow for single-item records (a book or a 30

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

videotape, for example) did not present a problem. For the technical team, the programming did become a challenge when a single record was to be a container for a resource with multiple pieces, i.e. a multimedia kit that included several textbooks, teacher guides, an accompanying CD-ROM, and a classroom poster. It was necessary that the predominant, most important piece in the kit should be made the primary item in the record and this would determine the GMD (General Material Designator) for the record. The GMD was important, because an online user can limit a search by the media type of the GMD; thus, an incorrectly assigned GMD can fail to retrieve all relevant records. Technical staff worked on how to proceed with programming the application tool, PowerBuilder 4, to meet ENC’s workflow requirements and special cataloging needs. Staff named this internal cataloging tool QCat, after the multiple working queues.

follow-up stages, starts the initial processing of resources by building the record in QCat that will follow the material as it works its way through the working queues. A team of librarians catalog the bibliographic data, abstractors analyze the content of each resource, and editors proof the final records before they become ‘‘live’’ for public viewing. The primary consideration for materials being reviewed for the collection was content validity and how useful the resource was to educators. The following are questions that the content specialists posed when reviewing potential materials: . Who is the intended audience? . Does the resource provide outstanding math or science? . Is the content accurate? . Is the material equitable with regard to culture, gender, race, religion, etc.? . Are viewpoints balanced? . Are the authors or publishers reputable? . Is the physical format of good quality and usable in the classroom? . How well-known are the sponsors of the material, the credentials of all contributors, and what are their affiliations? . Is the material engaging and ageappropriate? . Does the resource promote interaction and learning? . Is the information current? . Are the illustrations of good quality? . Is opportunity presented for innovative teaching and learning in the classroom?

Materials workflow Establishing workflow queues as part of the cataloging tool became increasingly important as the steps in the processing of materials became more complicated. In order to collect and control the materials arriving at ENC, a workflow system had to be set up that would structure staff with very different tasks: materials selection, materials acquisition, record building, content editing, bibliographic cataloging, abstracting, and editing. Interspersed among these workflow queues were built-in quality control queues to help staff maintain the highest quality of attention and treatment possible of each record. The quality control queues were juxtaposed after the cataloging, abstracting and editing queues. The processing of materials through QCat relied on an integrated team with very specific job duties. Owing to the subject-specific focus of the ENC collection, a team of math and science professionals, capable of understanding the nature and content of the materials being acquired, was necessary to develop the collection and abstract each resource. The acquisitions staff, responsible for contact with publishers and vendors in the ordering and

With confidence in the structure of the catalog record and the refining of the workflow queues, ENC was ready to focus on developing the collection. Parameters were set forth in the statement of work to build a library of science and math curriculum resources for K-12 classroom use and for professional development of teachers. The initial collection of traditional library formats, books, videotapes, CD-ROM, graphic materials, multimedia kits, and objects was processed easily and posed no problems with respect to repository management. With the growing presence of the Internet, and the availability of Web-based resources and downloadable software applications for classroom use, it was desirable for ENC to consider acquiring these newer formats of curriculum resources. Virtual resources, by 31

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

their very nature, are challenging with which to work and require some special considerations when acquiring, collecting and maintaining them in what had been a traditional physical repository. Thought was given to how these new formats could be optimally stored, retrieved, and examined by educators. The workflow at ENC originates with the mathematics and science content specialists, who work under parameters set forth in the ENC Collection development policy. It is their primary responsibility to locate materials through a variety of avenues: publisher catalogs, online sources, professional workshops, educator suggestions, conferences, and through professional reading. On initial selection, the content specialist assigns the materials a general subject designation: mathematics, science, general education, or an interdisciplinary subject that combines math and science. Further along in the workflow, the abstractor will assign subject headings that are more specific. The content specialist assigns a pedagogical type and grade level. The resource is given a priority ranking (the urgency with which the material should make it through all the processing queues), from zero (generally unsolicited materials), to four or six (processing at a steady rate), to ten (high attention and processing rate). A priority ten ranking indicates that the material has been targeted for a special in-house project at ENC. Special projects have been set up for resources destined for Digital Dozen, ENC Focus, and other ENC projects. As the content specialists complete this phase of the work, the requests for materials go to the acquisition staff, whose responsibility it is to start the ordering process. The acquisitions team is responsible for getting new resources to ENC. Requests for materials can come from any of the mathematics or science specialists on the staff. The acquisitions team researches Web sites, publisher catalogs, and product literature for each item request, then creates a shell record with a title, subtitle, series name, GMD, pricing, ISBN or other ordering numbers, and general subject identifier, to hold a place in the cataloging database until the item arrives. Once the order is successfully placed, dialogue is maintained with vendors via telephone, e-mail or facsimile as needed. Once the material is

unpacked and verified against the order, it is barcoded, repackaged, and labeled. Additional data can be added to fields in the previously-built record as needed, for example, special handling instructions or ordering updates. The resource is then forwarded via the online database to the content queue. Either the mathematics or science content team member examines and verifies that the item is what they requested. They are responsible for adding additional data to the record by assigning pedagogy, rechecking the designated subject field (M for mathematics, S for science, G for general education, I for integrated/ interdisciplinary approaches, and E for education technology), grade level, special project status and priority, and, in the case of Internet resources, making sure that the URL points to the appropriate resource. Once again, the priority number that the resource is given may determine how the material is selected and cataloged. A cataloger selects the material from the queue and adds bibliographic data to the record. Some of the data are already filled in on the record by acquisition or content staff; catalogers review and modify these data as needed. Cataloging fields include title, subtitle, variant titles, edition, and series. Catalogers also assign authors, publishers, and other contributor roles for those responsible for the creation of the resource. Catalogers add publishing location, copyright information, and dates associated with the item. At ENC, funding data are important. In this locally-created field, catalogers assign those individual or groups responsible for funding sponsorship. This is an especially important field associated with ENC mission and goals; it shows to the Department of Education that ENC is fulfilling its governmental contract (see Figures 1-7 to view various screen shots of the ENC cataloging database). Catalogers provide a detailed description of the resource that includes pagination, illustrative material and physical dimensions. The resource is also identified by any distinguishing properties it may have, i.e. running time for videotape, CD-ROM, and DVD; any equipment requirements or downloads; or downloadable equipment for the 32

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

Figure 1 ENC cataloging maintenance manager

Figure 2 ENC cataloging record manager displaying cataloging queue and record owner

resource to be accessed, viewed, or displayed by the user. If a resource has a table of contents, catalogers add this to the contents field. The table of contents, which is a searchable field in the public catalog, provides valuable keyword

search terms that aid in the discovery of the resource. Other cataloging fields include: language of the resource, geographic focus, availability notes, public notes, and supporting records (used to point to other materials in the 33

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

resource, and federal statistics. Once the cataloging is complete, the resource moves forward to the cataloging QC (quality control queue), then on into the abstracting queue. The abstractors are responsible for helping the content team identify new resources to include in the collection. They assign appropriate subject headings and identifiers to the catalog record. In addition, they compose a 200-300-word descriptive abstract and select the curriculum resources that are the basis for all the ENC-based special projects. The abstractors identify standards, exemplary criteria, and state correlations, and add these fields of data to the cataloging database. From the abstractors, the record passes to abstract QC for review. As both the back-end of the QCat database and the front end of the cataloging interface were being concurrently developed and refined, MSSQL version 6.5 was released. A decision was made to delay a server upgrade until the second service pack was released by Microsoft. During this intervention, a new version 5 of PowerBuilder became available. Once service pack number two was released, both the database and the partially developed client interface were converted. The server and

Figure 3 ENC internal statistics, general material designator (GMD), and media types

ENC collection that support the usage of the first record, i.e. a teacher guide for a set of videotapes, a calculator that support the use of a student workbook, etc.). Catalogers review the number of items in the record, the barcode, the GMD, the special projects, and then assign the media type, the technical format of the Figure 4 ENC funding screen

34

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

Figure 5 ENC specifications and equipment screen

Figure 6 ENC contributor screen displaying contributor roles, names, and publisher location

35

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

Figure 7 ENC items screen

could be moved without a major modification to the client interface. The problem in the interface was due to the fact that it was designed to adhere to strict business rules with regard to the relationship between items and records. Problems also arose from the use of inheritance in the application, resulting in complications when changing the properties of the base objects. After considerable discussion, it was determined that record processing was a linear process, and that catalog records should move forward only with few exceptions. The client interface was updated to the latest version, and the data entry windows were simplified. Once the server was upgraded to MSSQL2000, the client software could be upgraded to PowerBuilder 7. By this time, the basic work-processing queues were well-defined and were capable of withstanding future modifications when necessary. By building stand-alone windows, it eliminated the dependence on inheritance as well as greatly improving the ability to perform any alterations. If a window needed modification, the code could be changed directly without concern for ancestors; also, if the modification was extensive, a new window could be built to simply replace the old one. The size of the code was reduced by a fourth, the need for the high memory overhead of the original code was reduced, and development

database conversions were straightforward, and proceeded quickly. The client interface presented some problems: the compilation of PowerBuilder 4 to PowerBuilder 5 resulted in over 1,100 errors due to the development software changes. Not all the changes were negative; in fact, some very positive features had been added; primarily, the ability to use non-visual data stores. Correcting the errors as well as modifying the code to make use of some of the new features took several months. Once the code was completed, tested, and ready for distribution, a week was set aside for the migration of the more than 5,500 completed catalog records. This proved to be the most difficult part of the development process. FileMaker stored data as a large flat table, which divided into 42 SQL relational database tables. With the conversion completed, it took two weeks to train in the new version and also debug the client application. With use, it became apparent that the building of records as a collection of items was not an efficient way to process and maintain the data. The client application was slow and cumbersome, being burdened with the task of maintaining the relationships between several items and their records. To resolve this, steps were taken to modify the processing, so that the client interface could move as many of the data as possible from the items to the record itself. There was a limit, however, to how many data 36

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

SearchServer indexing engine was chosen by ENC in 1996 as the indexing engine that would drive Web-based searches of the ENC catalog. Fulcrum also resides on the UNIX server. The Fulcrum engine is capable of searching the individual large text fields like those found in the abstract. These large text fields cannot be entered directly into the Fulcrum database; they are processed a second time instead, their fields being tagged with specific encoding character sequences. These tagged fields are written to individual files named using the ENC record number. As the separate files are developed and the large text fields identified, the records are passed into a preprocessor that builds SQL statements and automatically inserts the data into the Fulcrum database for full indexing. This is where the online searching takes place. The Fulcrum search engine allows ENC users to apply a simple or advanced search query against the database and receive a return list of ENC records that matches their search string. The resource identifier is passed on to a PERL CGI program which dynamically generates HTML. At the beginning this procedure worked well but, as the usage of the ENC site increased, a more efficient way of generating HTML was necessary. The Vignette product was introduced in 1999 and dealt well with the issue of being capable of rendering completed Web pages. Vignette could also cache completed Web pages so that re-rendering them was not necessary until a time when some of the content on the page had changed (Vignette Corporation, n.d.). Vignette has the ability to read and parse XML, a much faster process than reading individual MARC-encoded files. In 2000 an additional step was introduced into the indexing process: the generation of XML records after Fulcrum indexing. The MARC-encoded files are used as input to the XML generator, which has a one-to-one correlation of XML files to MARC files. The container tags used in XML are the MARC tags used by ENC. As a user queries the ENC collection, Vignette reads the appropriate XML file, generates and displays HTML, and also caches the HTML to eliminate the need to re-render the exact record when next selected.

time was reduced by one third. This new version of the ENC cataloging tool was appropriately renamed NeoCat (ENC, 2002). NeoCat has been in use at ENC since 2001, and contains over 25,000 catalog records. Problems have been kept to a minimum, while several new features have been added: a new spell-check system has been installed, fields have been modified and added, and controlled vocabulary lists enhanced and refined. Export functions have been expanded, so that records are generated as ENC XML files, Open Archives Initiative (OAI) files based on the OAI Protocol for Metadata Harvesting, version 2.0, and Learning Object Metadata files based on the IEEE Draft Standard (IEEE Learning Technology Standards Committee, 2002).

Converting data files from the internal to the public site The ENC catalog was released for online public use in 1997. The Web site featured search and display features. Searches were performed by the Fulcrum SearchServer search engine (Fulcrum Technologies, n.d.). HTML was generated on demand from the MARC and ENC MARC extension-encoded flat files through PERL CGI scripts. MARC tagging and ENC MARC-extensions come into use when the daily creation of the single-tab delimited flat file is divided into individual records for the search engine, which creates and displays records for public view. The Microsoft Windows database server that holds the large MARC and ENC MARC-encoded flat text file moves the data to a UNIX server. This large flat file contains the complete set of data for each record that is created or modified during the day by ENC staff. Each record within the file contains a MARC field identifier or an ENC MARC field extension. Several preprocessing steps must be applied to the new and updated ENC records before they can appear on the Web site. The MARC file is divided into separate files, one for each record created. The files are named using an ENC record number. The large MARC file becomes obsolete once the individual files are stored on the UNIX server. The Fulcrum 37

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

ENC extensions and searchability

know precisely what terms ENC uses, the browse-by-subject function takes users through the different levels of ENC subject terms to find resources that match the specific term they request. For example, the first level of terms is general (mathematics); the next level is slightly more specific (algebra); the third level is quite specific (coefficients). The ENC homepage offers two search strategies: a simple search which provides a search based on a question, word, or phrase, and can be restricted to curriculum resource by its media type, grade level, or cost, and the advanced search which offers a Boolean-type search strategy at the first level: searches can be limited by ‘‘contain any of the words’’, ‘‘all of the words’’, ‘‘none of the words’’, or a particular phrase. Recently, the Fulcrum search engine was replaced with a search engine named Autonomy (Autonomy, n.d.). Autonomy is programmed to perform searches in the following fields: the subject lists, abstract, and table of contents, title, series, language, publisher, author, funding agency, and ENC number. Resources that are relevant to a given search are assessed according to the way they are represented and interpreted in the metadata that describe them, and then displayed to the user. ENC staff realize that ensuring relevance in a search relies to a large extent on the expertise of the staff and how well they understand indexing, cataloging and how the search and retrieval function works.

The new build of the ENC cataloging tool, QCat, once strictly MARC-based, was extended in 1996 to include ENC-developed fields (see Table II). ENC’s collection, currently containing over 25,000 records, has 6,050 resources funded or published by governmental agencies. The statement of work from the US Department of Education charged ENC to tag each cataloged record in such a way as to show its ‘‘federal-ness’’. ENC extensions for funding, special projects, and statistics contain the data that signify the materials’ federal creation or funding status. These fields were a significant addition to the catalog record for two reasons: they provide proof of the US government’s presence in education and they lend credence to the ENC collection. Other ENC-developed cataloging fields hold data for vendor/distributor contact information, so that each record contains all the information needed for educators to order the materials for classroom use. Data in the acquisition fields contain vendor address, telephone number, e-mail address, ISBN, ordering numbers, availability terms and pricing for each resource in the collection, with the exception of Internet resources. Also added were cataloging fields that could hold exemplary data for the curriculum resource. Decisions were made about how the fields included in the catalog record should be populated. Most importantly, it was necessary that the ENC user community would successfully retrieve the resources for which they searched. In some fields, controlled vocabulary lists were deemed necessary to facilitate improved user searchability. The controlled vocabulary for subject headings was developed by a collaboration of focus groups of K-12 educators and ENC subject specialists. In both the simple and advanced searches, the user has the opportunity to fill in a search box with one or more topics or subject words. Although these searches pull from a controlled vocabulary list, i.e. words must be on ENC’s list of subject terms, most subject terms familiar to educators are included in the list. There are specific ways, however, that ENC staff use terms in order to be consistent and to most accurately describe teaching materials. To

Handling the metadata today While ENC has adopted the MARCXML record, the data lend themselves to a straight conversion to both OAI XML format and IEEE LOM XML format. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) version 2.0 came into use in 2002. This XML format has become the standard form for submitting records via the OAI-PMH to the National STEM (Science, Technology, Engineering, and Mathematics) Digital Library (NSDL). ENC has contributed all its digital resources from all its collections to the NSDL for harvesting. OAI specifies the Dublin Core Metadata Element Set (Dublin Core, n.d.). 38

MARC to ENC MARC: bringing the collection forward

Library Hi Tech Volume 22 . Number 1 . 2004 . 28-39

Janet Kahkonen Smith, Roger L. Cunningham and Stephen P. Sarapata

Part of the daily maintenance of the ENC catalogs is the generation of OAI XML records. These OAI records are stored on a server and their status is maintained through an SQL database. The database is used to ensure the proper distribution of records requested through the OAI harvest. The work on the IEEE Learning Object Metadata (LOM) began in 1998; it became IEEE LOM 1484.12.1-2002 in June 2002. For standardization and compatibility in the digital library community, ENC has adopted the IEEE LOM format for digital resources.

References American Library Association (1967), Anglo-American Cataloging Rules, American Library Association, Chicago, IL. Autonomy (n.d.), Autonomy Server, Version 3, Autonomy, San Francisco, CA. Dublin Core (n.d.), Dublin Core Metadata Element Set, Version 1.1: Reference Description, available at: http:// dublincore/documents/dces/ Dynix Marquis (n.d.), Marquis Automated Library System, Version 3.1, Provo, UT. Eisenhower National Clearing-house (ENC) (2002), NeoCat, ENC, Columbus, OH, March. FileMaker (1994), FileMaker, FileMaker, Santa Clara, CA. Fulcrum Technologies (n.d.), Fulcrum SearchServer, Version 3.5, Ottawa, Canada. IEEE Learning Technology Standards Committee (2002), IEEE 1484.12.1-2002 LOM, available at: http://ltsc.ieee.org/ Powersoft Corporation (n.d.), PowerBuilder, version 4, Concord, MA. Vignette Corporation (n.d.), Vignette Content Management, Version 6, Austin, TX.

Conclusion The use of USMARC tags and the ability for MARC to be enhanced with ENC added-value educational extensions have resulted in a collection of richly-described curriculum resources. A working model was established with input from the educational community. Each resource in the collection is cataloged completely and objectively by mathematics and science educators and library professionals based on this model. The working model itself has been reviewed and modified over time. Through the years, ENC has improved the discovery of current and relevant curriculum resources which benefit the user. The workflow of resources has been optimized through the cataloging system and streamlining the process of entering data. Applying these efforts with new technology has resulted in content-rich data that are easily transportable to the formats in use today and those of tomorrow. The ENC collections are available to the public through the World Wide Web and support both simple and advanced searching capabilities, providing ease of use in locating curriculum resources. Visit us at the following sites: . http://enc.org . http://gsdl.enc.org . http://icontechlit.enc.org . http://thelearningmatrix.enc.org

Further reading ENC Online (n.d.), Eisenhower National Clearing-house, available at: http://enc.org Gender and Science Digital Library (n.d.), Eisenhower National Clearing-house, available at: http:// gsdl.enc.org/ Innovative Curriculum Online Network Digital Library (n.d.), Eisenhower National Clearing-house, available at: http://icontechlit.enc.org/ International Business Machines (n.d.), OS/2, Version 2.1, Armonk, New York, NY. (The) Learning Matrix (n.d.), Eisenhower National Clearinghouse, available at: http://thelearningmatrix.enc.org/ Lynch, C.A. (1997), ‘‘The Z39.50 information retrieval standard’’, D-Lib Magazine, April, available at: www.dlib.org/dlib/april97/04lynch.html (accessed 22 May 2003). Microsoft Corporation (n.d.), Microsoft SQL Server, Version 6.0, Redmond, Washington, DC. Microsoft Corporation (n.d.), Microsoft SQL Server, Version 6.5, Redmond, Washington, DC. Microsoft Corporation (n.d.), Microsoft SQL Server, Version 7.0, Redmond, Washington, DC. Microsoft Corporation (n.d.), Microsoft SQL Server 2000, Redmond, Washington, DC. National Science Digital Library (n.d.), available at: http:// nsdl.org/ Open Archives Initiative (n.d.), available at: www.openarchives.org/ Sullivan, T. (n.d.), AccessFlorida A Z39.50 Control System Programmers Reference and System Guide, Florida Center for Library Automation, Gainsville, FL. Sybase Inc. (n.d.), PowerBuilder, Version 5, Emeryville, CA. Sybase Inc. (n.d.), PowerBuilder, Version 7, Emeryville, CA. Sybase Inc. (n.d.), Sybase SQL Server, release 4.9.1, Emeryville, CA.

39

The history of the MARC standard

After MARC – what then?

The MARC format was developed to exchange bibliographic registrations. For the first few years the format was used to produce catalogue cards. The MARC format was designed and implemented many years before online catalogues and electronic publishing of bibliographic records were introduced. ISO2709 (ISO, 1996) provides the framework for MARC, and this is visible in the MARC formats’ use of three-character field codes, the use of indicators, and the use of sub-field codes. This kind of database structure is suitable for a sequential dataflow (which is exactly what ISO2709 was developed for) in order to handle data on magnetic tapes. ISO2709 and MARC formats can be used for other data types than bibliographic data. The format has not found much support in non-library environments, but there are many examples of library software being applied in other contexts than libraries. From the author’s previous employment as a consultant for suppliers of library systems, mention can be made of alternative use of library software as a tool for case management and registration of equipment, building materials, and hardware components in a telecommunications company. There are (at least) three current schools/divisions of the MARC formats: (1) USMARC – which became MARC21. (2) BNBMARC – which became UKMARC. (3) UNIMARC.

Leif Andresen

The author Leif Andresen is Library Advisory Officer, Danish National Library Authority, Copenhagen and Chair of the Danish Standards Technical Committee for Information and Documentation, Demark. Keywords Databases, Online cataloguing, Libraries, Denmark Abstract The article discusses the future of the MARC formats and outlines how future cataloguing practice and bibliographic records might look. Background and basic functionality of the MARC formats are outlined, and it is pointed out that MARC is manifest in several different formats. This is illustrated through a comparison between the MARC21 format and the Danish MARC format ‘‘danMARC2’’. It is argued that our present cataloguing codes and MARC formats are based primarily upon the Paris principles and that ‘‘functional requirements for bibliographic records’’ (FRBR) would serve as a more solid and user-oriented platform for future development of cataloguing codes and formats. Furthermore, it is argued that MARC is a library-specific format, which results in neither exchange with library external sectors nor inclusion of other texts being facilitated. XML could serve as the technical platform for a model for future registrations, consisting of some core data and different supplements of data necessary for different sectors and purposes.

There are certain similarities between UKMARC and USMARC, while the Danish format danMARC2 (Katalogdatara˚det, 1998) has been developed from the original BNBMARC. Some years ago it was discussed whether IMARC should replace USMARC, CANMARC and UKMARC, but due to differences in the three existing formats, including the question of multi-volume works and USMARC’s use of ISBD-characters as part of the record, IMARC did not replace any of the other formats. USMARC and UKMARC are based on the MARC format developed by the Library of

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm Library Hi Tech Volume 11 . Number 1 . 2004 . pp. 40-51 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524486

Received September 2003 Revised October 2003 Accepted November 2003

40

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

books but also continuing resources, computer files, maps, music, sound recordings, visual material, and mixed materials are handled by MARC formats. MARC21 appears more appropriate than danMARC2 when applied to cartographic materials and document collections. On the other hand, danMARC2 seems more appropriate for the registration of music and sound recordings. Fields, which apparently correspond in the two formats, are often used for data elements defined for different purposes in the two formats.

Congress in the 1960s. The purpose of that format was to exploit the new computer technology for the distribution of catalogue data. A number of basic characteristics, therefore, are common to both formats: the main field groups, according to their overall functions, are frequently the same. The descriptive title and originator entries are, in USMARC and UKMARC, placed in fields 210-24X, and notes are placed in fields 500-599. UNIMARC (IFLA, 1997), on the other hand, has a different division into main groups, and title information is not placed in field 245, but in field 200. Although MARC21 is often described as an international standard, it is only used in a limited number of countries. UNIMARC, which was created within IFLA, is in fact closer to an international standard. UNIMARC (first published in 1977) was meant to be an international exchange format between national bibliographic institutions. According to the preface to the latest edition (IFLA, 1994), it was designed as a model for the development of new machine-readable formats. UNIMARC has been applied in several south and eastern European countries as well as some Third World countries. This section has explained why MARC is not equal to MARC21. In the following section, a more detailed description of the differences between MARC21 and danMARC2 will be provided.

Cataloguing rules and formats The Danish format ‘‘danMARC’’ and the Danish cataloguing rules were revised at the same time, which meant that the many interdependent issues could be dealt with in the right context. For instance, all examples in the cataloguing rules are presented, as they should be formatted in danMARC2. The revision meant moving away from AACR2. The most substantial change was the abolition of main and added entry elements, which were replaced by access points and a voluntary location element. This difference is not very important when searching on controlled forms of data, since these can be isolated from non-controlled forms in search indexes. Since the choice of main entries and location elements are based on different principles, however, presentation as well as location will differ according to the set of rules employed.

Differences between DANMARC2 and MARC21

Format structure In MARC21 as well as danMARC2 each field contains at least one and often more sub-fields. Each field code is followed by two indicators, which can contain various informations on treatment or interpretation of data. MARC21 does not have indicators and sub-fields in control fields 001-008. In MARC21, indicators are widely used for different purposes, e.g. for generating text as ‘‘former title’’ and to indicate how many characters should be ignored when alphabetising in the first sub-field. In danMARC2 indicators are only used in a very limited number of fields: relations of periodica in field 860-879, and fields defined for local purposes.

DanMARC2 was developed in Denmark in the 1990s. This was done in order to create a common format for all sectors and to modernise the registration of library materials, since online catalogues (electronic publishing of bibliographic records) have obviously replaced the card catalogues. DanMARC2 is in many ways a more modern format than MARC21. The format is in various ways adapted to the technological development as it appeared in the middle of the 1990s. In order to be able to handle other types of material than printed materials, all MARC formats have undergone a considerable development. Today not only 41

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

In the descriptive fields in danMARC2, the division of fields into sub-fields follows ISBD’s division into elements. This facilitates the generation of separating characters in displays and printouts in ISBD and other formats. MARC21’s descriptive fields are not specified in as many sub-fields as in danMARC2, and the sub-fields are not repeatable to the same extent as in danMARC2. MARC21 also requires typographical separating characters to be keyed in, as in those cases where separating character and a sub-field code coincide. In a number of fields, MARC21 uses an alphabetising indicator to state how many characters should be ignored when alphabetising in the first sub-field. In danMARC2, the alphabetising indicator is replaced by an alphabetising symbol ‘‘¤’’, which is placed where the alphabetising in a given sub-field is supposed to start. This is a more flexible indication of alphabetisation, since the alphabetisation is not tied solely to the first sub-field in a field.

sub-title and slash before authors has to be keyed in: MARC21: 245 04 *a The listing attic ; The unstrung harp /*c by Edward Gorey 740 01 *a Listing attic 740 01 *a Unstrung harp danMARC2: 245 00 *a The ¤listing attic *a The ¤unstrung harp *e by Edward Gorey

Repetition of sub-field a (*a) and alphabetising symbol ‘‘¤’’ facilitates direct indexing. The smaller number of sub-fields in MARC21 means that design of indexes is less flexible and the possibilities for variation of presentation formats are more limited than is the case in danMARC2. To a certain extent, MARC21 compensates for this by facilitating larger redundancy of data in the records. In MARC21 author and title of an added work are placed in the same sub-field. MARC21: 245 14 *a The analysis of the law /*c Sir Matthew Hale. The students companion / Giles Jacob

Examples MARC21: 245 03 *a Le Bureau = La Oficina = Das Bu¨ro 246 10 *a Oficina 246 10 *a Bu¨ro

danMARC2: 245 00 *a The ¤analysis of the law *e Sir Matthew Hale *x The students companion *e Giles Jacob

With a specific sub-field for ISBN it is possible to use this information for system functions – such as entry on ISBN in another system.

danMARC2: 245 00 *a Le ¤Bureau *p La ¤Oficina *p Das ¤Bu¨ro

MARC21: 020 ## *a 0877790019 (black leather) *z0877780116 : *c $14.00

Sub-field p (*p)[1] and alphabetising symbol (¤) facilitates direct indexing. An equal sign before parallel titles must be keyed-in in MARC21.

danMARC2: 021 00 *a 0877790019 *c sort læder *d $14.00 *x 0877780116

MARC21: 245 10 *a He who hunted birds in his father’s village : *b the dimensions of a Haida myth /*c Gary Snyder ; preface by Nathaniel Tarn ; edited by Donald Allen

Codes are significantly different in MARC21 and danMARC2. The differences apply to application areas, definitions of the individual codes, and the values of these. In MARC21’s field 007, a considerable number of position-determined one-character codes have been defined for the statement of physical details. They are entirely different from danMARC2’s material type codes, and seem more suited for library-technical functions such as choice of play-back equipment and storage

danMARC2: 245 00 *a He who hunted birds in his father’s village *c the dimensions of a Haida myth *e Gary Snyder *f preface by Nathaniel Tarn *f edited by Donald Allen

MARC21 does not allow for differentiation between primary authors (sub-field *e) and secondary authors (sub-field *f). Colon before 42

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

an advantage of the MARC formats. Gradually, however, the text dependency has also turned into a disadvantage. In practice the MARC format allows for only strictly bibliographic data. Data such as reviews, indexes in their original forms, and image and sound files cannot be added. Supplementary information, which now and in the future will be used to enhance registrations in the catalogues, cannot be exchanged in a standardised form. Furthermore, there will often be a loss of data, because local additions cannot be read by a recipient system. A MARC format is by nature a strict format, since it was originally intended for the production of printed catalogue cards; therefore, the interaction between different library systems is not reflected. The lack of flexibility means that local additions might hinder exchange between local systems and union catalogue systems. The various library systems in the Danish public libraries have invented their own non-standardised peculiarities in order to update local records with national updatings without overwriting local additions/corrections. The research libraries in Denmark do not interplay as much with the union catalogue as the public libraries. Nevertheless, the research libraries differ according to cataloguing policy and practice, which causes matching problems in the union catalogue, where the identification and merging of records for the same title/edition is hampered. Large volumes of data cannot easily be incorporated in a MARC record, although there is no fixed limit for the amount of data. The author has seen MARC-based database systems used for large volumes of text, which have appeared rather clumsy. This is primarily caused by the database systems, but the construction of MARC does not really support the display of continuous texts. The foundation of MARC upon the printed catalogue means that the implementation of the MARC format is directed towards description and presentation of the individual document on card or screen. This means that it is difficult to handle relations between data that are described in different fields. In danMARC2, sub-field *a˚[2] is used to tack fields together, which means that, for example, connected title

conditions than for the generation of texts for search and presentation purposes. Multi-volume works In Denmark, two different methods for the formatting of multi-volume works are facilitated, either independent registrations of individual levels in separate bibliographic records or independent registration of individual levels in one bibliographic record. MARC21 does not offer a general instruction as to the formatting of multi-volume works. The most common practice seems to be a one-level method, where the collective work is described in one record. Both British Library and the Swedish union catalogue LIBRIS have considered various methods in connection with their shift to MARC21, but they have not made an unambiguous choice. MARC21 is not very flexible. There is no possibility for registering multi-volume works in several-records structures, which is facilitated and frequently used in danMARC2 records. The previous description of the differences between the two formats has served to illustrate that USMARC, which has now been re-christened MARC21, is a product of a 1960s technology. The Danish danMARC2 format has been modernised and adapted to the online age, but is nevertheless a product of the 1980s technology. The implementation of a MARC format is in many ways time-limited and it has hopefully been demonstrated that MARC is much more than MARC21.

Problems with MARC This section deals with a number of problems associated with the MARC formats. It must be stressed that these problems are associated with the practical application of the MARC formats and the problems mentioned could possibly be solved with a MARC format. Nevertheless, it is important to address the actual applications of the MARC formats and not what might, under different conditions, already have been worked out. MARC is used for text-based data input. The division into fields and sub-fields support a relatively specific division of data. This is suitable for bibliographic records and therefore 43

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

and artists on a music CD are stated in different fields. A catalogue record describes an edition, and there is no suitable system for describing connections between editions. To show the relation between main record and sub-records for a multi-volume work, danMARC2 uses field 014 in the sub-record to link to field 001 in the main record. This method for linking parts together seems quite rigid. MARC is implemented in library systems and is thereby library-specific. MARC is not immediately usable for exchange with environments other than the internal library environment. This goes for both the technological side in relation to packing (cf. later section) and for the construction of fields that are assigned to information similar to information used in systems outside the library sector. Although a number of problems with the MARC formats have been outlined, one must remember that the different MARC formats have served the libraries well. This author’s observations so far are not a criticism of previous choices, but merely a basis for the discussion of future developments. And despite the problems with the MARC formats, it is evident that at the moment there is no alternative! Various future developments point at upcoming changes, where the main initiator will be IFLA’s Functional Requirements for Bibliographical Records (FRBR) (IFLA Study Group on the Functional Requirements for Bibliographic Records, 1998) and the Extensible Markup Language (XML)[3].

The Paris principles In the Paris principles (De Deutsche Bibliothek, 1961) the catalogue’s function is described as: The catalogue should be an efficient instrument for ascertaining whether the library contains a particular book (include other library materials) specified by: (a) its author and title; or (b) if the author is not named in the book, its title alone; or (c) if author and title are inappropriate or insufficient for identification, a suitable substitute for the title; and (a) which works by a particular author; and (b) which editions of a particular work are in the library.

The first requirements support the catalogue’s identifying function. The cataloguing rules prescribe citing of title, creator, etc. The second requirements support the catalogue’s collocating functions. The obligation to show an author’s works is fulfilled by bringing together an originator’s production under a controlled form of his/her name – with references between varying and controlled forms of the name. In the same way the obligatory overview of editions is secured by using uniform titles. Here, the American cataloguing tradition is better able to fulfil the demand than the Danish one. Apart from the application of uniform titles or standard titles for musical works, there is no general Danish tradition for stating titles in normative forms. Functional requirements for bibliographic records (FRBR) Since the Paris principles, no requirements for catalogues or bibliographic records were seriously worked out until IFLA in 1998 published Functional Requirements for Bibliographic Records (FRBR) (IFLA Study Group on the Functional Requirements for Bibliographic Records, 1998). These requirements are important, because the Joint Steering Committee for the Revision of AACR, which is responsible for the development of AACR, has a policy for working towards accordance with FRBR.

Functionality of bibliographic records Over the years various theoretical demands have been formulated regarding the functionality of the catalogues and/or the bibliographic records. The Anglo-American cataloguing rules (AACR2) as well as the Danish cataloguing rules are attempts to implement the Paris principles from 1961. The Danish cataloguing rules were in their original form a modified version of the Anglo-American cataloguing rules. 44

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

characterising and individualising data at all four levels, in order to facilitate the identification and qualified choice among the records in a search set. The MARC formats support, with their title and document orientation, the Paris principles, while FRBR is orientated towards the handling of relations between works as well as subject-based access. Denton (2003) has written about the relationship between FRBR and the cataloguing rules. There is no pre-packaged model for the implementation of FRBR in practice, but work with implementing and applying FRBR will continue on an international level. So far, FRBR has not been incorporated in either a format or in cataloguing rules, but the problematic issues are being studied and which requirements, in relation to user interfaces in the library systems, FRBR would entail.

FRBR recommends a basic level for the functionality of bibliographic records. The bibliographic record should help the user: (1) To find all documents containing: . the works for which a given person or corporation is responsible; . the different representations of a given work; . works on a given topic; and . works in a given series. (2) To find a certain document: . when name(s) of person(s) and/or corporation(s) responsible for work(s) in the document is/are known; . when the title of the document is known; and . when the document identification is known. (3) To identify a work. (4) To identify a representation of a work. (5) To identify a document. (6) To choose a work. (7) To choose a representation of a work. (8) To choose a document. (9) To gain access to a document.

XML Extensible Markup Language (XML) is a simple and flexible text format derived from SGML (ISO 8879). Originally XML was designed to meet the challenges of large-scale electronic publishing. XML is also playing a more and more important role in the exchange of a wide variety of data on the Web and elsewhere. XML was developed by The World Wide Web Consortium (W3C), which develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential. An example of XML: a sample of a MODS-record with title information:

In these functional requirements the bibliographic entities are defined as: . A work is a delimited intellectual or artistic product. . A representation of a work (expression) is a work given an intellectual or artistic form. . A document (manifestation) is a representation of a work fixed in a physical form. . An item is a copy of a document. There are some discussions about the delimitation between these bibliographic entities (see e.g. Jonsson, 2002).

Sound and fury : the making of the punditocracy/

FRBR and MARC An important difference between the Paris principles and FRBR is that FRBR requires that a document contain at least one representation of one work. It is not sufficient to facilitate a search for the documents’ forms of title and originator data, to facilitate collocation of an author’s works by using controlled forms of names, and to facilitate collocation of the representations of a work. It is also necessary for the records to contain sufficient, unambiguous data for identification at all levels as well as for

In MARC21 would look like this: 245 10 *a Sound and fury :*b the making of the punditocracy

XML offers a more general standard for describing data for transport. This was not the case before and when tools like MARC and ISO2709 were developed. One changed condition is the space on discs and cables. An XML-formatted record takes up far more space than a MARC record, but the 45

After MARC – what then?

Leif Andresen

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

requirement for space is no longer as important as before. With XML a container for data exchange has been created that is being used globally. In itself XML is nothing – and it can be both standardised and very private. The formal description of an XML document can be created in various ways, but it is more and more common to define it in SML Schemas, which are available on the Internet. This means that data themselves can contain a format definition in the form of an URL, which points to the definition somewhere on the net. That a local addition to a general schema is a lesser problem, since data can contain a pointer to the definitions. With XML-related works can be linked (on the basis of the FRBR concept) and relationships presented in a logical way to the user. It is also possible to create a tree structure within XML, so that information about single items on a CD can be described as wholes (linked together on a ‘‘branch’’ of the tree). It is possible to embed a variety of other texts either as complete texts or as links. Other objects, such as image files, can also be embedded in the form of links. Using authority records more or less freely on the net will also be a possibility – the ideal is to use the national bibliographical authority base for authors from the country in question. XML has a number of tools attached to it, and XML has definitely become a mainstream technology. This means that programming of IT systems in relation to library data is changed from a specific to a mainstream technology. Among the standard tools attached to XML is the possibility of building conversions among various formats into a stylesheet, so that the individual record contains several formats. A registration can thus be presented in various ways, not only in shorter and longer presentation formats, but also in different MARC formats. This makes it possible to register at a detailed level, and automatically let data be represented in simpler structures for international interoperability. The specification of an XML format in the form of an XML Schema makes it possible to incorporate data validation. This will mean that different suppliers of systems by using the same XML Schema can support a uniform quality of data.

MARC and XML There are some initiatives concerned with the use of XML in relation to MARC records. In this context it is important to differentiate between content formats and transportation formats. A content formats is a format that defines actual descriptions and dictates, for example, that a title must be in field 245 sub-field *a, and that a field number must have three digits. A transportation format is a container into which various content formats can be placed. MARC21, UNIMARC and danMARC2 are all content formats. So far the only internationally standardised transport format is ISO2709, which can contain all the different MARC formats. A well-known content format presented in XML is the Metadata Object Description Schema (MODS)[4], which is a slightly reduced MARC21 format, described verbally and not in field codes. MODS is not a general XML-solution for MARC records, but comes close to a MARC21 solution. Yet, it does not solve (most of) the problems associated with MARC as such. As shown in the example in the previous section ‘‘title from a MODS record’’, the ISBD character set is applied. The transportation format ISO2709 uses a very compact form of data packing, which is eminently suited for the purpose it was originally designed for, namely packing on magnetic tape, using the smallest possible space. There are various versions of XML packings for MARC. For example a MARC21 version, created by Library of Congress (Library of Congress, 2003) and one made in connection with OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) (Open Archives Initiative, 2002). ISO2709 will probably be replaced by an XML schema. This was discussed at a meeting in ISO’s Committee for Interoperability within Information and Documentation (TC46/SC4) in May 2003. A process has begun to develop a Library of Congress MARC21 XML packing into an ISO-standard with a more general XML-based transportation format, incorporating all MARC formats. Apart from editing the comments (where MARC21 is to be replaced by MARC), this transportation format 46

After MARC – what then?

Leif Andresen

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

only requires information about which formats that are present in the record. The same title field in the above example would look like this in MODS:

to establish a cross-domain registration schema, and an ever-increasing part of materials from libraries, archives and museums are published on the Internet:

Sound and fury :

the the making of the punditocracy

Common data elements for Danish Archives-Libraries-Museums: Title Creator Contributor Publisher Subject Description Date – Refinements by: Created, Modified Type Format Identifier Source Language Relation Coverage – Refinements by: Spatial, Temporal Rights

This kind of supplement to ISO2709 is obviously made in XML, but offers only some of the advantages inherent in XML. Certainly, XML is a mainstream technology, but it is still the MARC format with fields, sub-fields and indicators that is embedded in the record.

Interplay with other sectors and functions MARC is library-specific and can in practice not interact with environments other than the internal library environment. Previously, this was not a problem, but with the increased exchange of data between different sectors it has now become problematic. The world around libraries has changed, and barriers between different sectors has broken down in various ways. A library is no longer an independent island, since interplay between libraries and with external sectors is a definite requirement. A university library must be able to interrelate with the university’s administrative systems and export bibliographic records to applications, which are often quite different and simpler, like various support tools for writing reports. Denmark has launched a project on recommendations for cooperation between institutions within the fields of archives, museums and libraries. The objective is to reach a common presentation of data from the three sectors. Dublin Core has been chosen, since the overall structure of this format makes a common presentation possible. On the other hand, it is not feasible to use Dublin Core internally in the sectors, as it is far too general and unable to cope with specific needs. The choice of Dublin Core is problematic, because this rather simple format is basically focusing on web resources. But it is an attempt

It turns out that there were data for all 15 Dublin core elements (ISO, 2003). The need to go beyond the 15 general elements was primarily caused by the need to express relations to other documents/objects in a more specific manner. Furthermore, there is a need to describe other things than just the time of publication. Another example is various governments’ or municipal administrations’ need to exchange data. Here, Dublin Core has been chosen as the basis, and a set of metadata elements has been defined for exchange between Danish authorities’ case management systems (DokForm-gruppen, 2002). Core metadata for Exchange of cases: Title Creator Subject DateCreated DateSend DateAvailable Identifier RightsSecurityClassification RightsAccessRights Receiver Type RelationHasPart RelationIsPartOf Process DateAction

47

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

Owner SendBy

ISO 8459 bibliographic data elements for non-bibliographic use: Author or creator Title Series Edition Place of publication Publisher Other contributor Date of publication Date of creation Form of contents Number of volumes Pagination Size of item Accompanying material Format of item (technical specifications) Subject or keyword Classification Abstract or description.

It turns out that public authorities have the same need to handle relations as archives and museums. This points to FRBR as a feasible model. Regarding the publishing sector, there is at the moment tremendous efforts going on to ensure interoperability between different systems, such as Online Information eXchange (ONIX)[5]. Some fairly comprehensive data schemes has been developed, which match MARC in complexity. Library of Congress has mapped 211 data elements in ONIX for MARC21 (Library of Congress, Network Development and MARC Standards Office, 2000), so interplay between ONIX and MARC21 will be possible. Yet, ONIX is of course completely adapted to the bookseller’s needs. From the learning community some basic data elements are picked up from Learning Object Metadata – LOM (IEEE LTSC Learning Object Metada Working Group, 2002).

A number of data elements are associated with circulation, ILL and acquisition. In this context, rights management plays a part. The DCMI-Libraries Working Group has explored various uses for the Dublin Core Metadata Element Set in library and related applications. They envisioned the following possible uses: . to serve as an interchange format between various systems using different metadata standards/formats; . to use for harvesting metadata from data sources within and outside of the library domain; . to support simple creation of library catalogue records for resources within a variety of systems; . to expose MARC data to other communities (through a conversion to DC); and . to allow for acquiring resource discovery metadata from non-library creators using DC.

Learning Object Metadata: Title Keyword Description Contribute – Refinement by roles: Author, Publisher etc. Contribute.date Format Identifier Language Relation Coverage Rights Version Learning resource type

The wording is a little different from the previous examples, but the basic content is very much the same. Today, a format has many functions, apart from facilitating search and presentation. In various administrative processes, data from bibliographic records are used in contexts, such as acquisition of material and interlibrary loans. In connection with the ISO 84459 series on data elements, a number of bibliographic data elements are defined for use in relation to non-bibliographic registration.

DC-Library Application Profile (draft): Title – Refinements by: Alternative Creator Contributor Publisher Subject Description – Refinements by: Abstract, Table Of Contents

48

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

There has been an ongoing discussion among Dublin Core users as well as within the Dublin Core Metadata Initiative[6] if DC is to be acknowledged as the simple set or if it should be greatly expanded.

Date – Refinements by: Created, Valid, Available, Issued, Modified, Copyrighted, Submitted, Accepted, Captured Type Format – Refinements by: Extent, Medium Identifier – Refinements by: Identifier Citation Source Language Relation – Refinements by: Is Version Of, Is Format Of, Has Format, Is Replaced By, Replaces, Is Part Of, Has Part, Requires, Is Referenced By, References Coverage – Refinements by: Spatial, Temporal Rights Audience Edition Location

Conclusion The development of the Internet means that data from different sectors are more visible and accessible than before. A natural consequence is that the borders between sectors is being demolished. This also applies to libraries. It is necessary to be open to the world outside and to share standards and data with other sectors. Bibliographic records should be much easier to reuse – by other institutions as well as other libraries. The question, then, is: how to create descriptions with several layers, which can be used in various contexts and still keep the duplication of data to a minimum? One possible model could contain a division of data: . Core data. . Supplement: data generally necessary for libraries. . Supplement: national need for libraries. . Supplement: specific needs of the individual library. . Supplement: data necessary for governmental information.

The question of rights management turns out to be a major issue outside of library systems. This has not been the case for the libraries, since they, as far as book material is concerned, have been covered by a clause in the copyright law, which state that loan can take place without an agreement from the originator. When comparing the examples above, one suggestion for a common selection (alphabetical order) could be: . Creator; . Contributor; . Date (available/created/published); . Description; . Edition; . Format (technical specifications); . Identifier; . Language; . Publisher; . Relation; . Subject; . Title; and . Type (genre).

The Core data could be selected with inspiration from Dublin Core’s 15 elements and applications of Dublin core in other sectors. The list in the previous section might act as inspiration, because it illustrates the necessity for a general schema to look further than Dublin Core, based as that is on Internet resources. For data generally necessary to the libraries, there are a great number of sources. FRBR is undoubtedly important, but also data element standards like ISO 8459 can be considered. Supplements should not be seen as a layer-upon-layer superstructure on the core. Output will be many-faceted, and vary depending on different contexts. Document descriptions could be embedded in a common structure that reflects relations between documents in relation to the FRBR-model’s expression and work. It must

This could lead to a future catalogue based on Dublin Core. But one has to bear in mind the basic principle for Dublin Core: to describe resources on the Internet for the purpose of searching. The idea of a general format for describing Internet resources has to some extent been successful. An obvious danger is just to use the Dublin Core to reinvent MARC. The draft for DC-Library Application Profile illustrates a Dublin Core problem, since the element Edition is neither a part of Dublin core nor any of the acknowledged refinements. 49

After MARC – what then?

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

Leif Andresen

developments for a future catalogue practice. Ideally, the suggested solution should be for ‘‘the whole world’’, but not for all kind of resources. The idea in this article is related to describing and discovering ‘‘information resources’’ which are primarily text-based. Metadata for the registration of other resources – from persons to nails and bricks – is also used, but the core metadata are completely different. Talking about core metadata for the whole world of information resources is also tackling a very big objective. In the practical world, there will be different metadata sets for different communities and not all of them will be connected. Yet it will be useful to work with this goal as a point of departure. Though not applying to the whole world, it will still be very valuable to bring the different MARC communities together. It should be possible to work out a broader solution. In any case: such a solution will have the potential for being the basis for interoperability between sectors.

offer the user the chance to navigate between different concrete documents as well as being able to see their mutual relations. The transition from the present formats will of course cause some problems. But if a new format is sufficiently specific, various MARC formats can be created based on data from such a system, thereby ensuring that new data can be used in old systems. On the other hand, conversion from a MARC format is not always easy, since the specificity will often be lower in the recipient format. A comparison between MARC21 and danMARC2 show that MARC21 has a relatively low specificity. One could consider a revision of MARC21 adapted to the online age. Although MARC21 contains many archaic traits, it would seem more reasonable to look forward rather than back to what might have been a good idea some ten years ago. Furthermore, a format transformation is quite costly. Over the past few years XML has become a magic term. But embedding of MARC data in XML is not enough. Replacing ISO2709 with a MARC XML format would be useful, but does not allow for data interplay with other sectors. It is obvious that an XML format must be developed in cooperation with other sectors. This XML format must be able to contain several layers in the description and at the same time accommodate different types of data. Embedment in XML is not perfect for all types of data, but for the most part an embedment with a trivial http-reference will be quite sufficient. It is important to get the balance right. We need international development efforts, because the definition of core data has to be based on broad consensus in the international library world. This also includes the choice of XML. The library world will have to leave MARC for something different and more flexible. I have tried to suggest in which direction they ought to move. Yet, there is a long way to go and it is going to take time. XML presents some opportunities and it could certainly be developed further. Primarily, a development of the FRBR-model and a resulting change of paradigm in cataloguing and registration practice is needed. This article does not claim to have given an exhausting account of either FRBR or XML, but hopefully there are some useful pointers to possible

Notes 1 The delimiter symbol is represented by an ‘‘*’’ in this article. 2 The letter ‘‘a˚’’ is a special Danish letter. 3 Extensible Markup Language (XML), available at: www.w3.org/XML/ 4 MODS, Metadata Object Description Schema, available at: www.loc.gov/standards/mods/ 5 ONIX Online Information eXxchange, available at: www.editeur.org/onix.html 6 Dublin Core Metadata Initiative, available at: http://dublincore.org/

References Denton, W. (2003), FRBR and Fundamental Cataloging Rules, Miskatonic University Press, available at: www.miskatonic.org/library/frbr.html DokForm-gruppen (2002), Metadata til udveksling mellem EDH/ESDH-systemer, IT – og Telestyrelsen, Copenhagen, available at: www.oio.dk/files/ horingsex-efter.doc IEEE Learning Technology Standards Committee (LTSC) Learning Object Metadata Working Group (2002), Standard for Information Technology –Education and Training Systems – Learning Objects and Metadata, IEEE P1484.12, available at: http://ltsc.ieee.org/wg12/ IFLA (1994), UNIMARC Manual: Bibliographic Format 1994, available at: www.ifla.org.sg/VI/3/p1996-1/sect.htm

50

After MARC – what then?

Leif Andresen

Library Hi Tech Volume 11 . Number 1 . 2004 . 40-51

IFLA Study Group on the Functional Requirements for Bibliographic Records (1998), Functional Requirements for Bibliographic Records: Final Report, K.G. Saur, Mu¨nchen, available at: www.ifla.org/VII/ s13/frbr/frbr.htm International Organization for Standardization (ISO) (1996), ISO 2709:1996 Information and documentation – Format for Information Exchange, ISO, Geneva. International Organization for Standardization (ISO) (2003), ISO 15836:2003 The Dublin Core Metadata Element Set, ISO, Geneva, available at: http://dublincore.org/ documents/dces/ (not ISO-style, but DCMI-style). Jonsson, G. (2002), The Basis for a Record in Major Cataloguing Codes and the Relation to FRBR, IFLA, Glasgow, available at: www.ifla.org/IV/ifla68/papers/ 052-133e.pdf Katalogdatara˚det (1998), danMARC2. Edb-format til inddatering og udveksling af bibliografiske data i

maskinlæsbar form, Biblioteksstyrelsen, Copenhagen, available at: www.kat-format.dk/danMARC2/ default.html Library of Congress, Network Development and MARC Standards Office (2000), ONIX to MARC 21 Mapping, available at: http://lcweb.loc.gov/marc/onix2marc.html Library of Congress (2003), MARC XML MARC 21 XML Schema, available at: www.loc.gov/standards/ marcxml/ Open Archives Initiative (2002), The Open Archives Initiative Protocol for Metadata Harvesting Protocol Version 2.0, available at: www.openarchives.org/OAI/2.0/ openarchivesprotocol.htm De Deutsche Bibliothek (1961), ‘‘Statement of principles’’, aopted by The International Conference on Cataloguing Principles, Paris, October, available at: www.ddb.de/news/pdf/paris_principles_1961.pdf

51

In the late 1990s several large university libraries (led by the University of California of Berkeley and with the support of the Digital Library Federation) worked together on a project known as Making of America II (MoA II). One of the objectives of MoA II was:

An introduction to the Metadata Encoding and Transmission Standard (METS)

. . . to create a proposed digital library object standard by encoding defined descriptive, administrative and structural metadata, along with primary content, inside a digital library object (University of California, 2001).

Morgan V. Cundiff

The project participants rightly articulated the need for such a standard encoding: to serve as a digital object transfer syntax, to function as a data format for use with digital libraries, and to function as a data format for use with digital repositories. Toward this end, MoA II produced an XML DTD that specified the data elements and encoding for a limited set of object types (only objects comprised of text and still image content files, including diaries, journals, photographs, and correspondence). Several institutions undertook projects to create testbeds of digital objects based on the DTD, and also to develop software tools that supported the creation and processing of the object documents (Hurley et al., n.d.). In February 2001 the Digital Library Federation sponsored a meeting called ‘‘The Making of America II DTD Workshop’’ (McDonough et al., 2001). The meeting was planned and directed by Jerome McDonough of New York University for the purpose of considering the future of the MoA II DTD (McDonough, formerly of UC Berkeley, was the primary author of the MoA II DTD). The meeting was attended by both veterans of the Making of America II project as well as representatives of other institutions involved with digital library projects or with digital preservation. The question that the group considered was: Should the MoA II DTD be revised to address its identified shortcomings or should the MoA II DTD be deprecated in favor of one of the other emerging XML standards for digital object and/or metadata management such as MPEG-7, RDF, or SMIL? The outcome of the meeting was a consensus that

The author Morgan V. Cundiff is Senior Standards Specialist in the Network Development and MARC Standards Office (NDMSO), Library of Congress, Washington, DC, USA. Keywords Online cataloguing, Digital libraries, Standards Abstract This article provides an introductory overview of the Metadata Encoding and Transmission Standard, better known as METS. It will be of most use to librarians and technical staff who are encountering METS for the first time. The article contains a brief history of the development of METS, a primer covering the basic structure and content of METS documents, and a discussion of several issues relevant to the implementation and continuing development of METS including object models, extension schemas, and application profiles. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Received September 2003 Revised November 2003 Accepted November 2003 # Morgan V. Cundiff

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 52-64 Emerald Group Publishing Limited . ISSN 0737-8831

52

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

express the hierarchical structure of digital library objects, the names and locations of the files that comprise those digital objects, and the associated descriptive and administrative metadata. (Note that, at present, METS is limited to digital objects comprised of text, images, audio, and video files.)

Figure 2 Sketch of minimal METS document with Structure Map (structMap)

This definition immediately begs the question, ‘‘What is a digital library object?’’. This is a question that could lead to a very open-ended discussion. METS itself does not provide a strict definition. However, for the purpose of the examples that follow, a digital library object (and therefore a METS document as well) will have a one-to-one correspondence with a typical library item (e.g. a book, a photograph, a sound recording, a map, and so on). This is in fact the approach most often taken with the several METS projects to date. More will be said below about other possibilities.

the library community would be best served with a new version of the MoA II DTD. It was further decided that the DTD should be recast as an XML Schema, and that the schema would not provide (as did MoA II) a vocabulary for expressing descriptive or administrative metadata. Rather, descriptive and administrative metadata would be (optionally) expressed using vocabularies specified by other appropriate schemas (either existing or yet to be created) and included in a given document instance by means of the XML namespace facility. By April of 2001 McDonough had produced a new draft schema for community review and it had been decided to call it METS (for Metadata Encoding and Transmission Standard). The ongoing development of METS continues to be an initiative of the Digital Library Federation. A METS editorial board was established with McDonough serving as chair. It was also arranged for the Library of Congress Network Development and MARC Standards Office (NDMSO) to be the official maintenance agency for the standard. The Library of Congress maintains the official METS Web site (www.loc.gov/ standards/mets/) and listserv. Development work on METS has been continuous to the present. Development has been an open process with comments and suggestions coming from many different contributors. The current version of METS is 1.3 and is considered to be stable (Library of Congress, n.d.).

Sections of a METS document A METS document is most quickly understood by examining its basic structural components[1]. A METS document can have up to seven major subsections: (1) a Mets Header (metsHdr); (2) a Descriptive Metadata Section (dmdSec); (3) an Administrative Metadata Section (amdSec); (4) a File Section (fileSec); (5) a Structure Map (structMap); (6) Structural Links (structLink); and (7) a Behavior Section (behaviorSec) (see Figure 1). The Structure Map At minimum, a METS document must have one major subsection, a Structure Map (structMap). The Structure Map is the backbone of a METS document. It is the means by which the hierarchical structure and the sequence of the components of a digital object Figure 1 Sketch of METS with seven major subsections

A METS primer A definition of METS might go something like this: METS is an XML Schema designed for the purpose of creating XML document instances that

53

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

are expressed. This is achieved by use of nested division (div) elements. Typical structural divisions of a digital library object might be the chapters and pages that comprise a book or the separate tracks that comprise a compact disc. A simple sketch (Figure 2) illustrates this. Figure 3 shows a valid METS document for a compact disc with two tracks. Notice that the TYPE attribute indicates the type of div and the

Figure 5 METS document with File Section (fileSec) element

Figure 3 A valid METS document with Structure Map (structMap)

ORDER attribute indicates the sequence in which the tracks on the compact discs occur.

Location (Flocat) element which specifies the location and filename of a file. Notice also in the structMap that each of the two div elements for the compact disc tracks contains a File Pointer (fptr) element. Within a METS document it is often desirable to associate information in one section of the document with information in a different section. This internal linking is accomplished by means of XML ID/IDREF attributes. In this case, each of the fptr elements has a FILEID attribute (an IDREF attribute) which contains a value (e.g. ‘‘file01’’) that makes the association between the particular div and the corresponding file. Here, the corresponding file element has an ID attribute with the matching value ‘‘file01’’. The METS document is now much more useful. It is now possible to discern the structure of the digital object and also to locate the corresponding files[2].

The File Section The document in Figure 3 is a valid METS instance but it is not yet very useful. Though it is not required, a METS object will most often also include a File Section (fileSec) element as illustrated in Figure 4. The File Section is where the names and locations of the files that comprise the digital object are listed. Figure 5 shows the compact disc example with a File Group (fileGrp) added. Notice that the fileSec element contains one File Group (fileGrp) element and that the fileGrp element contains two File (file) elements, one for each of the two tracks on the compact disc. Each file element contains a File Figure 4 Sketch of METS document with File Section (fileSec) element

Descriptive Metadata A METS document may optionally include a Descriptive Metadata Section (dmdSec) as illustrated in Figure 6. 54

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 8 METS document with Metadata Reference (mdRef) element (points to descriptive metadata record in external system)

Figure 9 Sketch of METS document with XML Data (xmlData) element

systems using the Metadata Reference (mdRef) element, and also, alternatively, provides a mechanism for embedding descriptive metadata from a different namespace in the METS document using the Metadata Wrap (mdWrap) element. The sketch in Figure 7 shows the use of the mdRef element. Figure 8 shows a METS document with a Metadata Reference (mdRef) element. Notice that the xlink:href attribute of the mdRef element contains a URL which points to a MARC bibliographic record in an external bibliographic database. An alternate method of managing descriptive metadata is to embed the bibliographic data into the METS document. This is accomplished by using the Metadata Wrap (mdWrap) element and either the XML Data (xmlData) or Binary Data (binData) subelements. The former is probably more typical and is illustrated here (Figure 9). The XML Data (xmlData) element acts as a kind of ‘‘socket’’ into which elements from a different namespace (i.e. elements defined in a different XML Schema) can be inserted.

The Descriptive Metadata Section does not provide a descriptive metadata vocabulary of its own. Rather, it provides a mechanism for pointing to metadata in external documents or Figure 7 Sketch of METS document with Descriptive Metadata Section (dmdSec) element

55

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 10 METS document with Metadata Wrap (mdWrap) element

Figure 10 shows a METS document with elements from the MODS descriptive metadata schema. In this case MODS is functioning as an ‘‘extension schema’’ to METS (more will be said about extension schemas below). Notice that the elements from the MODS schema (titleInfo, physicalDescription, etc.) are ‘‘namespace qualified’’ with the ‘‘mods:’’ prefix, indicating that these elements belong to the MODS vocabulary and not to METS. Notice also the pairs of ID/IDREF attributes. The fptr FILEID points to the file ID and the div DMDID points to the dmdSec ID. Making these associations explicit in this example may not really be necessary because there is only one div, one file, and one dmdSec. However, in cases where there are multiple div elements, file elements, and perhaps multiple sections of descriptive metadata, the associations are more

important. Consider the example in Figure 10 in which there are two div elements with fptr subelements, two file elements, and two MODS relatedItem elements containing descriptive metadata for each of the two tracks on the compact disc. In a case like this ID/IDREF links are necessary to maintaining the appropriate associations between different metadata sections. Notice the ID/IDREF links from fptr to file elements (as in Figure 9) but also notice the links from the div elements to dmdSec elements. The high-level div (TYPE= ‘‘compactDiscObject’’) is associated with the entire dmdSec and the lower-level div elements (TYPE=‘‘track’’) are associated with track-level descriptive metadata. This ability to make associations between different levels of the Structure Map with the appropriate sub-parts 56

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 11 METS document with Metadata Wrap (mdWrap) element

of the descriptive metadata is a very powerful

Metadata Section (amdSec)is further divided

and useful feature (see Figure 11).

into four subsections: (1) Technical Metadata (techMD); (2) Rights Metadata (rightsMD); (3) Source Metadata (sourceMD); and (4) Digital Provenance Metadata

Administrative metadata As with descriptive metadata, METS does not provide a vocabulary of its own for

(digiprovMD).

administrative metadata. The Administrative 57

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 12 Sketch of METS document with techMD, rightsMD, sourceMD, and digiprovMD elements

Figure 13 METS document with Technical Metadata (techMD) element

Again, as with descriptive metadata, each of the four administrative metadata subsections may contain a Metadata Reference (mdRef) subelement to point to metadata external to the METS document, or a Metadata Wrap (mdWrap) subelement to embed administrative metadata in the METS document. And again, it is with the use of extension schemas that administrative metadata is expressed within a METS document (see sketch in Figure 12). Figure 13 shows a METS document for a single digital still image (as in Figure 10). This example uses the MIX schema to express technical metadata for still images. Including rights metadata, source metadata, digital provenance metadata would all work the same 58

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 14 METS document with METS Header (metsHdr) element

Figure 15 METS document with Structural Links (structLink) element

59

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure METS document with Behavior Section (behaviorSec) element

way, though of course using different extension schemas. Notice that the file element ADMID attribute value (‘‘tmd001’’) associates the file with the appropriate technical metadata.

Behavior Section A METS document may optionally include a Behavior Section (behaviorSec). A behavior section can be used to associate executable behaviors with content in the METS

METS Header A METS document may optionally include a Mets Header (metsHdr) section. The METS Header contains information about the METS document itself (primarily information about the creation of the document) as opposed to information about the source object (see Figure 14).

object. A behavior section contains one or more behavior elements, each of which has an interface definition element that represents an abstract definition of the set of behaviors represented by a particular behavior section. A behavior also has a mechanism element which is used to point to a module of executable code that implements and runs the behavior defined abstractly by the

Structural Links A METS document may optionally include a Structural Links section. The Structural Links (structLink) element may contain one or more Structural Map Link (smLink) elements, each of which can express a link between any two div elements in the Structure Map (structMap) section. This facility was added to METS to provide a way to express the hyperlinks between web pages in the case where the digital object in question is an archived Web site, or ‘‘web site snapshot’’. A simple example of this is provided in Figure 14. It could possibly be used to create ‘‘alternative, horizontal arrangements of the divisions of the structMap’’ (Beaubien, 2003b) for any object type if there was a reason to do so (see Figure 15).

interface definition. Digital object behaviors can be implemented as linkages to distributed web services[3] (see Figure 15).

The Behavior Section was added to METS at the urging of the FEDORA project. The FEDORA web site provides additional information about and examples of the Behavior Section (www.ectra.info). An example of a behavior might be something like a page-turning functionality for a book object (see Figure 16). Notice that the behavior element STRUCTID attribute is used to associate the behavior with a particular div element. 60

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Figure 17 METS document functioning as a collection object (with METS Pointer elements

Object models

digital library object, i.e. one library item is expressed as one METS document. However, different approaches are possible. In the examples above the fptr element was used to associate a particular div element with a

In the examples given thus far there has been a simple relationship between one library item (a compact disc, a photograph, etc.) and one 61

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

To date, three extension schemas for descriptive metadata have been endorsed (MARCXML, MODS, and Simple Dublin Core) and two schemas for technical metadata (MIX for still images and TextMD for text) (Library of Congress, 2003b). Schemas for expressing descriptive metadata are fairly stable but producing a complete suite of schemas for expressing administrative metadata remains one of the challenges for the METS community (Library of Congress, 2003c). It is also interesting to note that the development of administrative metadata extension schemas is intertwined with the issue of ‘‘preservation metadata’’. The Reference Model for an Open Archival Information System (OAIS) (ISO, 2002) articulates a set of abstract information categories that are part of an Archival Information Package (AIP). These conceptual categories are called Packaging Information, Content Information (including Representation Information) and Preservation Description Information (including Reference Information, Context Information, Provenance Information, and Fixity Information). In a study done for the Library of Congress it was argued that these conceptual categories would map to a METS document that made use of a full set of extension schemas (Library of Congress, 2003c). Many, though not all, of the preservation metadata elements would reside in the Administrative Metadata extension schemas. Thus, a METS document could be considered to be an implementation of an OAIS Information Package. There is, of course, still much work to be done in the area of defining preservation metadata, and consequently the task of creating definitive extension schemas for use with METS is very difficult. The PREMIS (PREservation Metadata: Implementation Strategies) Working Group II is:

particular file element within a single METS document. It is also possible to make associations between div elements and other METS documents using the METS Pointer (mptr) element. This makes it possible to create parent-child relationships and thus to create hierarchies of METS documents. There are several scenarios where this approach might be used. One is to encode a journal run using a top-down hierarchy of Title to Volumes to Issues to Articles. Each level of the hierarchy would be comprised of one or more METS documents. Another scenario might be where, in a particular repository, there is reason to create a METS document for every file and then to associate those files at the object level with a parent METS document. Another scenario is use a METS document to aggregate any arbitrary group of digital objects into a collection. At the Library of Congress this approach was taken for a recent project called ‘‘Patriotic Melodies’’ (www.loc.gov/rr/perform/ patriotic). Collection objects were created for a given work and then library item objects were specified as members of the collection object. The example in Figure 17 shows a collection object for The Library of Congress March. The mptr element is used to point to the three members of the collection: a set of score and parts, a sound recording, and a set of manuscript sketches.

Extension schemas, OAIS, and preservation It was noted that METS is extended by the use of extension schemas for descriptive metadata and administrative metadata. It was also noted that administrative metadata is subdivided into four subcategories (technical, rights, source, and digital provenance). Technical metadata schemas created to date have been specific to either still images, text, audio, or video and the same might be true for Source metadata. So there is potential for a fairly long list of needed schemas. Though METS implementers are free to use any extension schemas, it is also clear that standardization will be beneficial for digital library and digital repository interoperability. Toward this end the METS editorial board endorses selected schemas for use with METS.

. . . aimed at the development of recommendations and best practices for implementing preservation metadata.

The work of this group may be helpful in clarifying what preservation metadata elements need to be captured for a given object type (www.oclc.org/research/projects/pmwg/). 62

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Application profiles

.

METS is intentionally designed to provide a great deal of flexibility. It is understood that different implementers, or communities of implementers, may have very different objectives and constraints, and that therefore local practices will vary widely. On the other hand, there are also compelling reasons to promote common practice, including interoperability of digital libraries, software tool creation, and easier entry for institutions just getting started with METS. The METS Board has recently proposed the creation and registration of METS Application Profiles. The METS Profile 1.0 Requirements document provides the following definition:

.

.

A description of any structural requirements regarding the construction of the METS object itself. A detailed description of the allowable technical characteristics of content files or executable behaviors. A description of affiliated tools.

A METS Application Profile is expressed as an XML document itself. An XML Schema has been created for this purpose (Library of Congress, 2003e).

Conclusion METS provides a method for aggregating all the metadata relevant to a given digital library object. A well-constructed METS object is then relatively easy to manage. Because METS is expressed in XML, there is a wide range of software, often free, that can be used to create, store, display, transform, navigate, query, or publish METS objects. The current frontier for METS work includes additional extension schema development, application profile development, and software tool development. There is also need for thorough documentation and training material. A number of libraries and library organizations have already launched METS-based applications including OCLC, RLG, the California Digital Library, the Library of Congress, the University of California at Berkeley, the University of Graz, Harvard, MIT, Stanford, and Oxford University (University of California, n.d.). The degree to which the use of METS continues to spread will be one of the interesting things to watch in 2004.

METS Profiles are intended to describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance they require to create and process METS documents conforming with a particular profile. Institutions creating profiles for their METS objects may register those profiles with the Library of Congress Network Development and MARC Standards Office after they are approved by the METS Editorial Board; such profiles will be made publicly available so that those wishing to produce METS documents in accordance with a particular institutions’ requirements may examine them (Library of Congress, 2003d).

An Application Profile contains a specified set of elements: . A URI which has been assigned to the profile. . A short title for the class of profiled METS documents. . An abstract. . A date and time specifying when the profile was created. . Contact information. . A date when the profile was registered with the Library of Congress. . An indication of any other profiles which may be related to the current profile. . An enumeration of all extension schema. . An enumeration of any rules of descriptions along with details of where those rules apply. . An enumeration of any controlled vocabularies.

Notes 1 There are several prerequisites to gaining a good understanding of METS. Familiarity with the XML standard and the adjunct standards for XML Schema, XLink, and XML Namespaces is especially important. 2 The relationship between fptr and file elements may not be as simple as in the examples given (see Beaubien, 2003a). 3 This explanation of the behavior section is quoted from Library of Congress (2003b). 4 A pioneering project at the Library of Congress, the Digital Audio-Visual Preservation Prototyping Project,

63

An introduction to the Metadata Encoding and Transmission Standard

Library Hi Tech Volume 22 . Number 1 . 2004 . 52-64

Morgan V. Cundiff

Library of Congress (2003b), METS Editorial Board Endorses Extension Schemas for Use with METS, available at: www.loc.gov/standards/mets/news061503.html Library of Congress (2003c), Archival Information Package (AIP) Design Study, available at: http://lcweb.loc.gov/ rr/mopic/avprot/AIP-Study_v19.pdf Library of Congress (2003d), METS Profile 1.0 Requirements, available at: www.loc.gov/standards/mets/ profile_docs/METS.profile.requirements.rtf Library of Congress (2003e), METS Profile Schema Version 1.0, available at: www.loc.gov/standards/mets/ profile_docs/mets.profile.v1-0.xsd Library of Congress (n.d.), METS Schema, available at: www.loc.gov/standards/mets/mets.xsd McDonough, J., Myrick, L. and Stedfeld, E. (2001), ‘‘Report on the making of America II DTD digital library federation workshop’’, available at: www.diglib.org/ standards/metssum.pdf University of California (2001), The Making of America II, available at: http://sunsite.berkeley.edu/MOA2/ University of California (n.d.), METS Implementation Registry, available at http://sunsite.berkeley.edu/mets/ registry

has proposed a complete set of METS extension schemas, available at: http://lcweb.loc.gov/rr/mopic/ avprot/metsmenu2.html

References Beaubien, R. (2003a), ‘‘BrainDumpRickBeaubien’’ (/wiki/ BrainDumpRickBeaubien?action=fullsearch&value= BrainDumpRickBeaubien&literal=1&case=1), available at: http://raymondyee.net/wiki/ BrainDumpRickBeaubien Hurley, B., Price-Wilkin, J., Proffitt, M. and Besser, H. (2000), ‘‘The making of America II testbed project: a digital library service model’’, available at: www.clir.org/ pubs/reports/pub87/contents.html ISO (2002), Reference Model for an Open Archival Information System (OAIS), ISO, Geneva, January, available at: wwwclassic.ccsds.org/documents/pdf/ CCSDS-650.0-B-1.pdf Library of Congress (2003a), METS: An Overview and Tutorial, available at: www.loc.gov/standards/mets/ METSOverview.v2.html

64

RLG cultural materials (RCM) is a virtual collection of cultural works and artifacts from RLG member institutions, in the form of an online database of digital multimedia representations (or ‘‘surrogates’’) of works and their corresponding textual descriptions. The world-class collections come from an expanding alliance of more than 60 RLG member institutions that includes museums, libraries, archives, and historical societies. Launched in 2001, this unique assemblage of materials is an exceptional database for use in teaching and learning. Examples of the types of materials represented in RCM include: maps, photographs, objects, art, sound, and film[1]. Currently (October 2003), the service contains 205,723 works in 82 collections contributed by 26 institutions. Individual ‘‘works’’ can be an image, a video, a group of images, or a book – ranging from the simple to the highly complex. Some of the works represented in RCM are simple to express in digital form. Many works consist of a digital image of a single photograph – these are represented by a single digital file, and no structural metadata are required to model the work. This work may be accompanied by descriptive metadata, represented in a MARC record, by a Dublin Core record, by a record constructed according to the VRA Core, or by a home-grown or locally defined item-level description. These simple works can be represented in a straightforward manner, and no navigational apparatus is required. When many different digital files represent a work – that is, where structural metadata are required to model the object – a different approach is required. RCM contains not only works represented by a single file, but also works such as books, sheet music, and other page-turned objects. For example, 32,000 photographs represent 859 works from the Library of Congress’s South Pacific Ethnographic Archives. The single work representing the Ellesmere Chaucer contributed by the Huntington Library consists of over 200 images of illuminated manuscript

Pulling it all together: use of METS in RLG cultural materials service Merrilee Proffitt

The author Merrilee Proffitt is Program Officer for Access to Research Resources, RLG, Mountain View, California, USA. Keywords Online cataloguing, Standards Abstract RLG has used METS for a particular application, that is as a wrapper for structural metadata. When RLG cultural materials was launched, there was no single way to deal with ‘‘complex digital objects’’. METS provides a standard means of encoding metadata regarding the digital objects represented in RCM, and METS has now been fully integrated into the workflow for this service. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 65-68 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524503

Received 7 October 2003 Revised 3 November 2003 Accepted 7 November 2003

65

Pulling it all together: use of METS in RLG cultural materials service

Merrilee Proffitt

Library Hi Tech Volume 22 . Number 1 . 2004 . 65-68

pages. If one includes the individual components in these types of complex works, there are over half a million digital files represented in RCM. Although it is possible to represent these complex digital objects in a non-standard way, RLG wanted to use a standard to represent these works in RCM. RLG made the decision to use the Metadata Encoding and Transmission Standard (METS) in 2002, and in early 2003 RLG implemented a METS viewer for the RCM system to allow display and navigation of complex digital objects. METS is a standard for ‘‘encoding descriptive, administrative, and structural metadata regarding objects within a digital library’’[2]. METS provides us with a standard to follow and an active community of colleagues with whom to confer about issues unique to this topic. Further, because METS is expressed in XML, it fits well into the RCM application[3]. Whenever possible, RLG prefers to use standards within our systems and services. When there is a ready standard available, organizations need not develop unique practices. RLG is also encouraging the use of

METS in the RCM Alliance Descriptive Guidelines[4]. Hopefully, this will encourage others to give METS a try. A ‘‘complex’’ RCM work can appear in a search result alongside other, simpler descriptions. The work appears with a thumbnail image and brief description in multiple-work results, just like any other item (see Figure 1). When the work is viewed in more detail, the structural details of the object, its description, and its related digital materials can all be consistently navigated by basing that process on the METS object. For example, a viewer for the METS object for the Ellesmere Manuscript of Geoffrey Chaucer’s Canterbury Tales provides a detailed description of the work as a whole, a way to navigate the individual tales and individual folios through a table-of-contents interface, and tools to zoom in to examine closely page details (see Figures 2 and 3). The accompanying examples show a digital object that consists largely of page images. The same framework would function as well for works that include encoded text, video, sound, and other digital object

Figure 1 The Canterbury Tales as shown in the list view for a search of Chaucer

66

Pulling it all together: use of METS in RLG cultural materials service

Library Hi Tech Volume 22 . Number 1 . 2004 . 65-68

Merrilee Proffitt

Figure 2 The first leaf in The Knight’s Tale in RLG’s METS viewer

Figure 3 The user can choose to see a variety of image sizes

67

Pulling it all together: use of METS in RLG cultural materials service

Library Hi Tech Volume 22 . Number 1 . 2004 . 65-68

Merrilee Proffitt

Conclusion

types. The viewer uses HTML and JavaScript to show the content described by a METS object. The METS object stored in the RCM object repository is responsible for: . identifying descriptive data for the work as a whole; . administrative information about how the digitized parts of the work were produced; . an inventory of all the digitized objects associated with the work; and . a ‘‘structural map’’ that shows a hierarchical structure for the work, making connections between digital objects in the work’s inventory and that hierarchy.

So far, METS has proved a good fit for both the RCM application and for the data that reside in RCM. We have successfully loaded a variety of different collections from a number of contributing institutions, and look forward to extending our METS viewer capability, as we receive complex works with different structural models and various data types.

Notes 1 Available at: http://culturalmaterials.rlg.org 2 Available at: www.loc.gov/standards/METS 3 The RCM application is built on IBM’s DB2 Universal Database, a current-generation, object-relational database management system. The descriptive data are stored as XML, and formatted for display using XSLT (Extensible Stylesheet Language: Transformation). The application servers run IBM’s WebSphere and an Apache Web server. 4 Available at: www.rlg.org/culturalres/descguide.html

To display a work the viewer receives a streamlined version of the METS object that is created by applying an XSLT (eXtensible Stylesheet Language: Transformation) stylesheet to the METS XML data.

68

Research libraries hold vast and increasing amounts of high-quality digital content. Such content includes not just bibliographic records and born-digital materials, but also digital versions of the library’s primary holdings such as archival manuscripts and pictorial collections. Libraries have traditionally provided access to their materials through interfaces created and hosted by the libraries themselves. With the increasing adoption of the tools of educational technology (learning management systems, specialized environments for the teaching of disciplines and presentation of research results) and Web-based collaborative and authoring tools in general, users begin to expect seamless access to library resources in multiple contexts outside library systems. The library and educational technology communities are only beginning to address jointly the challenges of making library and educational/instructional environments work together (‘‘interoperate’’) well enough to facilitate seamless access for users[1]. Among the many barriers impeding interoperability between libraries and educational tools is the difference in specifications commonly used for the exchange of digital objects and metadata. Extensible Markup Language (XML) interoperability specifications proposed within the library/digital repository community and within the educational technology domain are geared to the exchange of content within that specific community. By themselves, these standards do not address the need for cross-community transmission of data. Metadata Encoding and Transmission Standard (METS), which is maintained in the Network Development and MARC Standards Office of the Library of Congress[2] and is being developed as an initiative of the Digital Library Federation[3], uses XML to provide a vocabulary and syntax for applying structural metadata to digital content, and for linking structure and content with pertinent descriptive

A preliminary crosswalk from METS to IMS content packaging Raymond Yee and Rick Beaubien

The authors Raymond Yee is Technology Architect, Interactive University Project, and Rick Beaubien is Lead Software Engineer, Research and Development, Library Systems Office, both at the University of California, Berkeley, California, USA. Keywords Extensible markup language, Standards, Education Abstract As educational technology becomes pervasive, demand will grow for library content to be incorporated into courseware. Among the barriers impeding interoperability between libraries and educational tools is the difference in specifications commonly used for the exchange of digital objects and metadata. Among libraries, Metadata Encoding and Transmission Standard (METS) is a new but increasingly popular standard; the IMS content-package (IMS-CP) plays a parallel role in educational technology. This article describes how METS-encoded library content can be converted into digital objects for IMS-compliant systems through an XSLT-based crosswalk. The conceptual models behind METS and IMS-CP are compared, the design and limitations of an XSLT-based translation are described, and the crosswalks are related to other techniques to enhance interoperability. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Received 26 August 2003 Revised 17 September 2003 Accepted 7 November 2003 The authors would like to acknowledge valuable feedback from David A. Greenbaum and generous financial support from the Andrew W. Mellon Foundation.

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 69-81 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524512

69

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

and administrative metadata. It attempts to provide a standard basis for managing digital materials within a digital library, for exchanging materials between digital libraries, and for disseminating the digital content and associated metadata to the end user. The IMS Content Packaging (IMS-CP) specification, one of several XML-based specifications created under the auspices of the IMS to facilitate interoperability among educational environments, plays a parallel role in the educational technology community:

main components and features of METS, as these are likely to be applied by research libraries to digital versions of their holdings. Once the article has introduced the key features of METS, it will go on to look at a potential IMS-CP encoding of this same object. The simple METS and IMS-CP documents representing the same digital content will provide a starting-point for a more in-depth discussion of the high level similarities between the two standards, as well as of some conceptual and practical differences. Finally, this article will propose an Extensible Stylesheet Language: Transformation (XSLT) crosswalk intended to serve as a preliminary tool for incorporating library METS objects into learning packages. Readers should note that METS and IMS-CP are both fairly flexible standards, and this article does not attempt to deal with all possibilities for either a METS encoding or an IMS-CP encoding. The sample METS encoding represents options a research library is likely to use when representing digital versions of their holdings in METS. There is, however, no standard library METS encoding or profile, at least at this point, and, even within this relatively small community, practices are likely to vary fairly widely. Nor is there a standard encoding for learning packages, although various groups are exploring more narrowly-defined profiles and implementations of IMS-CP. When and if profiles emerge that narrow the range of METS and IMS-CP options and practices for certain applications, it should be possible to write profile-specific transformations between METS and IMS-CP objects that have fewer ambiguities. In the meantime, this article does not attempt to suggest a definitive encoding for library METS objects nor for IMS-CP objects. At best, it can suggest a preliminary, provisional and contingent means for absorbing library METS objects into IMS-CP packages. This article assumes that METS and IMS-CP are not competing standards, and its purpose is not to appraise their relative merits. These are standards that attempt to meet the needs of different communities, and the main purpose here is to surface the concepts and to begin to establish a vocabulary that can facilitate communication and collaboration between the

.

The IMS Content Packaging (CP) Information Model describes data structures that are used to provide interoperability of Internet-based content with content creation tools, learning management systems (LMS), and run time environments. The objective of the IMS-CP Information Model is to define a standardized set of structures that can be used to exchange content. These structures provide the basis for standardized data bindings that allow software developers and implementers to create instructional materials that interoperate across authoring tools, LMSs, and run time environments that have been developed independently by various software developers[4].

Of the range of scenarios that need to be accounted for in the full interoperability between library and educational tools, this paper focuses on the case of absorbing METS objects representing library resources into an IMS-CP learning package. The adoption of METS by a significant number of libraries, museums, and archives – a process already well under way – to package their digital content would result in tens of thousands of interesting, high-quality METS documents ready for incorporation into learning tools. Since IMS-CP is emerging as the leading contender for a standard file format for learning content, materials encoded in METS documents will likely find their way into learning environments via translation to IMS-CP. Techniques for such migration are the focus of this article.

Methodology This article will start by examining a very simple METS document representing a digital version of primary source material – in this case, a manuscript dictation that is part of the Hubert Howe Bancroft Collection held by The Bancroft Library of the University of California, Berkeley. This examination will introduce the 70

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

learning and library communities, which after all share many common goals.

link these pages to the files in the file section that manifest their content. Since only one manifestation of each page is available, each page division points (by means of a file pointer or fptr element) to single file. The file section, then, mainly provides an inventory of the files that comprise the content of the digital object represented. Each file element points to the external content file it represents via a URI. The optional descriptive metadata (dmdSec) and administrative metadata (amdSec) sections of a METS document record the descriptive and administrative metadata pertinent to the digital content. METS does not itself define a descriptive or administrative metadata element set, and the example records these metadata using elements defined in external standards maintained by the Library of Congress: MODS and MIX. The root division of the structural map points to the single descriptive metadata section present in the METS document, since these descriptive metadata pertain to all the content represented by the structural map. Each file in the file section points to the section of technical administrative metadata that pertains to it.

Simple encoding compared The simple METS and IMS-CP documents presented below are much simpler than would be encountered in the real world, but will help to clarify high-level similarities between METS and IMS-CP. In particular, the descriptive and administrative metadata represented in the documents, which are encoded using the Metadata Object Description Schema (MODS)[5] and Metadata for Images in XML (MIX)[6] standards respectively, simply serve as placeholders for what would typically be much richer encodings. Also note that the samples omit some XML management data for the sake of clarity, and that consequently the documents will not validate as shown. (Full versions of the simple METS document and its simple IMS-CP version are available online[7]. A Web presentation of the sample METS document is also available[8].) Simple METS document Below are the top level elements that can comprise a METS document (see Figure 1): . [metsHdr (header)] . dmdSec (descriptive metadata section) . amdSec (administrative metadata section) . fileSec (file section) . structMap (structural map) . [structLink (structural links section)] . [behaviorSec (behavior section)]

Simple IMS-CP version The simple IMS-CP version of the METS document could be a product of a simple transformation, as specified by an XSLT stylesheet (see Figure 2). This article will discuss such a possible transformation stylesheet below. Here the focus is on the high level details of the content package itself. Below are the top level elements which can comprise an IMS-CP document and are demonstrated by the transformed METS document:

Note that the sample METS document does not demonstrate the bracketed sections above (header, structural links section or behavior section). These have important functions for some applications but are not among the core sections of METS. Of the core sections, the structural map (structMap) is the most important, and is in fact the only required element of a METS document. This map analyzes the structure of the digital content represented into a hierarchically arranged sequence of divisions (div elements). In the case of the simple example in Figure 1, the structural map simply analyzes the dictation into a sequence of two physical pages. The structural map goes on to

manifest * organizations o organization * resources o resource

The IMS-CP organizations element can contain different possible organizations of the content, one of which must be the default and each of which is represented by an organization element. The IMS-CP organization element corresponds to the METS structural map (which is also repeatable). Like the METS structural map, the IMS-CP organization element analyzes the structure of the learning package into a hierarchically-arranged sequence 71

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

Figure 1 Simple METS encoding

identifierref attribute.) The resources section, then, provides an inventory of the resources that together comprise the content of the learning package and corresponds to the METS file section.

of divisions (items) – here representing the two pages of the manuscript dictation. Each of these items points to the resource in the resources section that manifests the content of the page it represents. (It does this by means of an 72

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

Figure 2 Simple IMS-CP version

therefore, the descriptive metadata pertaining to the manuscript appear directly within the organization element that analyzes its structure, and the technical metadata about each resource appear directly within the resource element to

Like a METS document, a learning package can also present pertinent descriptive and administrative metadata. IMS-CP, however, does not segregate such metadata from the structure and content. In the sample document, 73

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

which the metadata pertain. Like METS, IMS-CP does not itself specify descriptive and administrative metadata elements, and an IMS-CP document must use elements defined in external standards to encode such metadata. The sample IMS-CP document uses the same standards as the METS sample: MODS and MIX. In the real world, as this article discusses later, an IMS-CP document would be more likely to use the IMS Metadata (IMS-MD) specification[1] or the very closely-related IEEE Learning Objects Metadata (LOM) standard[9].

between these files. The METS document could also go on to specify how these content files should be presented to the end user – the library patron – by identifying an appropriate presentation program and specifying how this presentation program should be invoked. If the METS document did specify a presentation, however, this presentation would be clearly separated from the digital content. (It would be specified in the behavior section not covered in the simple example.) As applied by many libraries, METS documents representing digital versions of primary holdings will probably not address presentation at all; rather, these METS documents will simply point to the raw digital content, and provide structural clues on which any number of presentations of this content might be based. While presentation of the content to library patrons is, of course, very important, different uses, audiences, times and available technologies are likely to require different presentations. A library is likely to be less concerned with using METS to pin down a particular presentation of digital library materials than with providing for long-term preservation of the content itself, and ensuring its ability to communicate the structure and content of these materials to other times, places and institutions. While a Web-based presentation of the content represented by a library METS document is very likely, the METS document may do no more than provide the structural clues on which such a presentation could be based. In the library at the University of California at Berkeley, for example, a Java servlet-based program called GenView disseminates the content of library METS objects to end users; but the METS documents themselves make no reference to GenView, and they make no attempt to make their content files ‘‘GenView-ready’’ (by wrapping the content in html, for example). And GenView is just one of many potential programs that might disseminate the content of a UC Berkeley Library METS document. Like METS, the IMS-CP standard does not dictate a particular presentation of content; nonetheless, it is common for IMS-CP documents to organize their content files into ‘‘Webcontent’’ resources. Such resources presuppose browser presentation and in

High level comparison As the simple METS and IMS-CP examples given in Figures 1 and 2 demonstrate, there is a great similarity between the two standards at the highest level. Both provide for: . inventorying the resources or files that make up the content of a digital object; . specifying how the resources all fit together into a coherent, hierarchically structured whole; and . expressing descriptive and administrative metadata pertaining to the content. Both METS and IMS-CP allow the content files comprising a digital object to be assembled in more than one way, thereby providing the user with different possible approaches to and experiences of the same content. Neither defines nor prescribes a particular vocabulary for describing the content of the digital object, nor for expressing associated administrative metadata such as information about how the content files were produced, or what rights restrictions pertain to their use. Both allow external standards to be used for these purposes. Differences in provisions for content Despite the high-level similarity between the two standards, certain conceptual and practical differences are likely to distinguish documents implementing the two standards. These differences can pose challenges for absorbing METS objects representing library resources into an IMS-CP learning package. As demonstrated by the simple example, a METS document can enumerate the files comprising the digital version of library material, and express the structural relationship 74

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

museum communities will probably favor Visual Resources Association (VRA) (when an XML version of this standard becomes available). Most METS users will probably favor MIX, based on the NISO data dictionary (Z39.87), for capturing technical metadata about images. The choice for other administrative metadata categories, such as rights, is not so clear. When absorbing a METS object into an IMS-CP package, parsing out, say, MODS and MIX metadata into IMS-MD categories and elements would not be straightforward. For one thing, METS and IMS-MD tend to categorize descriptive and administrative metadata in ways that are not precisely aligned. Furthermore, IMS-MD tends to be descriptive-metadatapoor from a library cataloging practice standpoint, and technical-metadata-poor from a preservation standpoint. On the other hand, descriptive and administrative metadata in a library METS object are likely to lack many metadata elements pertinent to courseware and learning environments. While a translation from MODS and MIX to IMS-MD is likely to incur fairly substantial losses, it is not clear that such losses would be especially significant from a practical standpoint insofar as the transformed METS object would not be a replacement for its original; rather, it would simply be a surrogate ‘‘rearranged’’ to perform in another – learning – environment. The losses would only be important if they affected the performance of the transformed METS object within this environment. The inability of the library METS object to provide certain metadata pertinent to learning environments might be the more serious deficiency. Given sufficient resources and expertise, IMS-MD metadata could of course be included in library METS objects in anticipation of their being absorbed into electronic course materials. Both IMS-CP and IMS-MD accommodate descriptive and administrative metadata elements from other standards, so any elements appearing in a source METS document for which IMS-MD provided no analogue could still be represented in an IMS-CP object, even within the context of IMS-MD metadata. Still, the extent to which advantage can be taken of such open policies is probably limited. Such

addition to raw content files (such as image, audio and/or video files) such resources are likely to include an HTML file that controls the presentation of the content files in the user’s browser. And the HTML file is likely to invoke auxiliary program files, with appropriate parameters, to assist with the presentation of the content. When an IMS-CP document organizes ‘‘Webcontent’’, it in fact does not distinguish strongly between presentation and content, and it considers the HTML file that manages the presentation as an integral part of the content. Absorbing a METS document into a learning object representing ‘‘Webcontent’’ and presupposing browser presentation may involve some challenges. Where the content files specified in the METS document are already Web browser-compatible (as would be the case with jpeg and gif images), there is no problem. These can simply become ‘‘Webcontent’’ resources in the learning package. Where the content files are audio or video files, which the browser cannot present natively, however, the METS content must somehow be made browser-ready before it can be absorbed into the learning package. If the METS document analyzes its content into a structure that references just parts of files (a timed segment of an audio file, for example) or if the METS document makes available multiple versions of the same content (images at different resolutions, for example), then accommodating the METS content to IMS-CP ‘‘Webcontent’’ becomes more complex still. Differences in provisions for associated metadata Neither METS nor IMS-CP itself explicitly prescribes a standard vocabulary for expressing descriptive and administrative metadata. As noted above, both allow these metadata to be expressed using external standards. The communities that use METS and IMS-CP are likely to favor the use of certain standards for certain purposes. The learning community, for example, strongly favors the use of IMS-MD (or LOM) in IMS-CP documents. The favored auxiliary metadata schemata in METS are not always so clear-cut. Libraries using METS are likely to favor MODS, Dublin Core (DC) or even MARCXML for describing content; 75

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

policies can effectively dismantle the very standard that provides for them, and make it more difficult to manage the objects implementing the standards in their target environments.

close comparison between METS and IMS-CP presented in this article suggests many needed refinements. Including descriptive and administrative metadata translations in the crosswalk would be one such refinement. A basic approach would be to directly copy the metadata from a METS document into the IMS-CP package, given that IMS-CP permits the incorporation of arbitrary XML-based metadata[13]. IMS-MD allows for extensions, resulting in yet another place to incorporate arbitrary XML-based metadata[14]. On the other hand, if one wants to express metadata in terms of core IMS-MD, the preferred metadata schema for IMS-CP, one is faced with the challenge of translating the arbitrary metadata schema permissible in METS to IMS-MD. A crosswalk-writer might be saved from such a grim scenario, should he choose to limit himself to popular metadata specifications to be found in METS. Which metadata specifications are popular remain to be determined by the patterns of adoption of METS, which is in the early stages. Crosswalks from IMS-CP to METS can be written to enable the migration of digital content in the opposite direction, for example, the archiving (preservation) of learning materials by a library or the hosting of learning materials through library-hosted digital repositories. The detailed comparison between the METS and IMS-CP of this article should help jump-start such an effort. Tackling an IMS-CP to METS crosswalk, however, does bring up questions that might require input beyond analyzing the specifications per se. Does it make sense to attempt to translate the elements of an IMS-CP document into METS or would it be suitable just to wrap the IMS-CP document in a METS document, and possibly expose some of the descriptive metadata at the top level of the METS wrapper? Is it possible to round-trip documents between the two specifications? If not, is modifying the specifications to allow round-tripping a worthwhile goal for which to aim?

A model XSLT-based crosswalk from METS to IMS-CP Figure 3 presents a crosswalk from METS to IMS-CP written in XSLT[10]. A traditional crosswalk is ‘‘a table that maps the relationships and equivalences between two or more metadata formats’’[11]. By expressing the crosswalk mapping as XSLT, one can use a server or client-side XSLT engine to transform a METS document to a corresponding IMS-CP object. The crosswalk presented in Figure 3 is a preliminary one aimed at transforming relatively simple METS documents such as that listed in Figure 1. That is, the XSLT captures the high level correspondences between METS and IMS-CP, mapping the respective XML elements that comprise the inventory of files and their hierarchical relationships. If one uses the XSLT engine hosted by the World Wide Web Consortium (W3C) to transform the simple METS document (Figure 1), one will notice that the resultant IMS-CP document[12] is not the same IMS-CP as in Figure 2. The crosswalk presented here has some key limitations, most important of which is that there is no attempt to transform any of the metadata of the METS document. On the other hand, the IMS-CP document presented in Figure 2 was written to include descriptive and administrative metadata elements so as to illustrate the placement of these metadata elements in the IMS-CP structure. Again, the XSLT is presented in the article mostly as a proof-of-concept that XSLT is a convenient, expressive and powerful way to formalize a crosswalk from METS to IMS-CP.

Refinements that await practical experience in the community Effective crosswalks able to translate real-world METS documents into viable IMS-CP objects need to account for not only the formal structures of METS and IMS-CP, but also how

Future work Refining crosswalks through analysis of specifications The crosswalk between METS and IMS-CP is clearly at the early stages of development. The 76

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

Figure 3 METS to IMS-CP crosswalk

77

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

Figure 3

78

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

specifications alone. Complex specifications such as METS and IMS-CP are abstractions, full of ambiguities. Working tools embody a set of practices around a specification, even if such practices are not formally documented. Being able to validate documents against the METS and IMS-CP schema is an important first step; nonetheless, not being able to play a valid document in a viewer is roughly akin to writing grammatically correct English sentences without knowing whether they represent anything a native speaker would actually say. The GenView METS/Making of America II (MOA2) viewer implemented by the UC Berkeley Library Systems Office, which can be fed any Web-based METS document, was a tremendous aid to the authors in understanding and communicating the nature of the METS specification in theory and practice. Although there are open source tools for the reading and editing of IMS-CP documents (e.g. Reload[20]), the authors do not know of any publicly-available Web-based viewer for IMS-CP. This deficiency hinders IMS-CP adoption and development. With a publicly Web-based IMS-CP viewer, one could quickly learn about IMS-CP through creating a trial IMS-CP document and feeding it to the viewer. Writers of crosswalks could both present the XSLT crosswalk and send the product of the translation to such a viewer, thus demonstrating the crosswalk in action. A note of caution is in order. Although working applications are invaluable, one must not read too much into how a specification is interpreted by any particular application, since that tool might be demonstrating an incomplete or incorrect interpretation of the specifications, or a particular interpretation when many others are legitimate.

METS and IMS-CP are actually used in practice. Although documentation about the two schemata is readily available, the authors have had access to a limited range of real-world METS and IMS-CP documents. It is too early to expect to find a comprehensive library of representative documents, since both the adoption cycles for METS and IMS-CP are still in the early stages. Consequently, the full range of scenarios in which METS documents can be sensibly incorporated into IMS-CP objects remains unexplored. Both specifications are broad and flexible, able to encompass a wide range of use. Indeed, both communities have applied the notion of application profiles to formalize specific applications of the specifications[15]. The METS community is currently working on a standard to govern METS profiles, and intends to establish a registry of profiles[16]. The Centre for Educational Technology Interoperability Standards (CETIS) has been sponsoring ‘‘codebashes’’ in which representative IMS-CP packages are being accumulated[17]. Canadian Core Learning Resource Metadata Application Profile (CanCore)[18] and Sharable Content Object Reference Model (SCORM)[19] are application profiles of IEEE LOM and the IMS-CP/IMS-MD specifications, respectively. The development of profiles for METS and IMS-CP will make the task of writing crosswalks more focused and tractable. A crosswalk that translates between two given profiles of METS and IMS-CP does not need to handle the full range of possibilities that arbitrary METS and IMS-CP documents can encompass. Since no profiles for METS yet formally exist, the crosswalk presented in this article is based not on a profile of METS, but on a somewhat arbitrary narrowing of METS that highlights core features of the specification. As profiles develop, this crosswalk can be elaborated to handle them. Selecting profiles between which to translate will be greatly aided by discussion with members from the library and educational technology communities about promising profiles and the nuances of how various concepts in communities can or cannot be translated. With or without formal profiles, viewers, editors, or other tools that actually work with the XML specifications make writing crosswalks much easier than having the

Crosswalks in the context of continued dialogue between the library and educational technology communities Ideally, this article will provide some encouragement and guidance to the developers of interoperability specifications in the library and educational technology communities to harmonize related specifications where possible, thus reducing the need for crosswalks in the first place. One primary goal of this article is to develop awareness in the two communities of 79

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

the existence and development of METS and IMS-CP – activities that have gone on largely in parallel – and to offer tentative bridges between the two specifications through the mechanism of a crosswalk. One possible solution to the problem of METS and IMS-CP interoperability would be to abandon one specification in favor of the other, or to fuse the two into some greater specification that subsumes both. Pursuing such a strategy, however, is probably at once unrealistic and inappropriate. It is unrealistic because there is already momentum behind the two specifications in their respective communities. It is inappropriate because each specification reflects the needs of its own community. It is not obvious from comparing the two specifications that one can subsume the other. Although the library community and the educational technology communities might be expected to have major similarities, there is little reason to believe that all their needs can be met in one specification. There is still the possibility that METS and IMS-CP could be unified into a super-standard that fully encompassed the two specifications, along with a super-metadata specification that could subsume MODS, MIX, LOM, and all other significant metadata specifications. Even if such a synthesis were possible, would the complexity of such a super-specification be so great that it would never be adopted? (Norm Friesen provides a very helpful discussion of the issues raised here[21].)

Writing crosswalks between METS and IMS-CP is a pragmatic approach to enhancing semantic interoperability among libraries and educational technology. Although this article has focused solely on interoperability between the two XML specifications of METS and IMS-CP, the authors are mindful that there are many other specifications used to encode content that many users will want translated into IMS-CP or METS. By focusing on a small number of XML-based interoperability specifications that are of importance in the various domains, this article has been able to propose direct crosswalks between the specifications, and thereby avoid the need for an abstract scheme that encompasses a myriad XML specifications. Such an approach, obviously, does not scale as the number of XML specifications increases. It is important, therefore, to also consider other more generically focused interoperability efforts in the library and educational technology communities. Godby and her group at OCLC have been working on a metadata schema translation service to apply crosswalks (such as the one produced in this article) to input documents via Web services (Godby et al., 2003). Semantic interoperability of metadata and information in unlike environments (SIMILE) is a recently funded, large-scale collaboration among the MIT Library (DSpace), the MIT Computer Science department, Hewlett Packard, and the W3C to leverage the connection among libraries, the semantic Web, and personal information management[22]. A practical question for writers of crosswalks is whether the crosswalks can be written in such a way as to make easier the automated extraction of knowledge implicit in the XSLT for projects such as SIMILE. As users become interested in extracting sections, rather than the whole, of METS and other XML documents, the question of how crosswalks can be modified to enable the disaggregating and recombining of content from multiple documents will arise. Although this issue has not yet hit the library and educational technology communities, one can get a glimpse into how it may play out by studying the debates around Rich Site Summary or RDF Site Summary (RSS) and related formats among Weblogging

Semantic interoperability in the large: moving beyond only METS and IMS-CP It is not hard to imagine that library users would like to draw on multiple sources of digital content in seamless, integrated ways regardless of underlying protocols and data/metadataencoding schemes. Creating the full spectrum of interoperability required for such functionality remains an extremely challenging and multifaceted research problem. Among the various aspects of interoperability is the problem of semantic interoperability, ‘‘integrating resources that were developed using different vocabularies and different perspectives on the data’’ (Heflin and Hendler, 2000). 80

A preliminary crosswalk from METS to IMS content packaging

Library Hi Tech Volume 22 . Number 1 . 2004 . 69-81

Raymond Yee and Rick Beaubien

2 Available at: www.loc.gov/standards/ 3 Available at: www.diglib.org/ 4 Available at: www.imsglobal.org/content/packaging/ cpv1p1p3/imscp_infov1p1p3.html#1519896 5 Available at: www.loc.gov/standards/mods/ 6 Available at: www.loc.gov/standards/mix/ 7 Available at: http://iu.berkeley.edu/crosswalk/Filer/ filetree/libraryhitech/figure1.xml and http:// iu.berkeley.edu/crosswalk/Filer/filetree/libraryhitech/ figure2.xml 8 Available at: http://sunsite.berkeley.edu/xdlib/servlet/ archobj?DOCCHOICE=metstest%2FSimpleMETS.xml 9 Available at: http://ltsc.ieee.org/wg12/ 10 Available at: http://iu.berkeley.edu/crosswalk/Filer/ filetree/libraryhitech/figure3.xsl 11 Available at: http://library.csun.edu/mwoodley/ dublincoreglossary.html 12 Available at: www.w3.org/2000/06/webdata/ xslt?xslfile=http%3A%2F%2Fiu.berkeley.edu%2F crosswalk%2FFiler%2Ffiletree%2Flibraryhitech% 2Ffigure3.xsl&xmlfile=http%3A%2F%2Fiu.berkeley. edu%2Fcrosswalk%2FFiler%2Ffiletree% 2Flibraryhitech%2Ffigure1.xml&transform=Submit 13 Available at: www.imsproject.org/content/packaging/ cpv1p1p3/imscp_bindv1p1p3.html#1520279 14 Available at: www.imsproject.org/metadata/ imsmdv1p2p1/imsmd_bindv1p2p1.html#1184108 15 Available at: www.cancore.ca/ faq.html#application%20profile 16 Available at: www.loc.gov/standards/mets/ profile_docs/mets.profile.v1-0.xsd and http:// www.loc.gov/standards/mets/profile_docs/ METS.profile.requirements.rtf 17 Available at: www.cetis.ac.uk/content/ 20030702184848 18 Available at: www.cancore.ca/ 19 Available at: www.adlnet.org/ 20 Available at: www.reload.ac.uk/ 21 Available at: www.cancore.ca/documents/ semantic.html 22 Available at: http://web.mit.edu/simile/www/ 23 Available at: www.xml.com/lpt/a/2002/12/18/ dive-into-xml.html 24 Available at: www.kottke.org/03/08/ 030806refined_rss_.html 25 Available at: www.xml.com/pub/a/2003/08/20/ dive.html and www.xml.com/pub/a/2003/07/30/ practicalRDF.html 26 Available at: www.intertwingly.net/wiki/pie/ Motivation and http://blogs.law.harvard.edu/tech/rss

technologists. RSS is a family of XML formats used to syndicate content ranging from news items, Weblog entries, and ‘‘pretty much anything that can be broken down into discrete items’’[23]. RSS has also been a quickly evolving testbed (and lightning-rod) over a number of questions: Is it better to extend RSS via the addition of core elements or by modules? How important is the readability of the format to adoption and creative use of the format[24]? Are the benefits of casting RSS (or its successor) in the framework of Resource Description Framework (RDF), part of the semantic Web stack) worth the increased abstraction[25]? Are the demands of ‘‘episodic Web sites’’ such as Weblogs better served by the creation of a new specification or by working with RSS[26]? What are the implications of RSS for the syndication of content already expressed as METS and IMS-CP? These questions merit further investigation.

Conclusions Work to translate between METS and IMS-CP is at an early stage. This article illustrates how a simple METS document can be translated into an IMS-CP version via an XSLT crosswalk. Although there is a great similarity between the two standards at the highest level, certain conceptual and practical differences distinguish documents implementing the two standards. The close comparison between METS and IMS-CP presented in this article suggests many needed refinements to the crosswalk. Including descriptive and administrative metadata translations in the crosswalk would be one such refinement. The development of profiles for METS and IMS-CP will make the task of writing crosswalks more focused and tractable. Ultimately, there will be a need to synthesize such crosswalks with efforts to create semantic interoperability among the full range of XML specifications in use in the library and educational communities.

References Godby, J., Smith, D. and Childress, E. (2003), Dublin Core 2003, Seattle, WA. Heflin, J. and Hendler, J. (2000), Proceedings of Extreme Markup Languages 2000, Graphic Communications Association, Alexandria, VA, pp. 111-20.

Notes 1 Available at: www.imsglobal.org/DLims_white_ paper_publicdraft_1.pdf

81

Publications, even print but especially electronic, are growing in number and catalogers must have viable options for speeding up the process of bibliographic control. The Metadata Object Description Schema (MODS), a MARC 21 companion, is one of those options. MODS is intended to address the need for a MARC 21 derivative that: . takes advantage of the XML environment; . gives special support to cataloging electronic resources; . is less detailed; and . most importantly, is highly compatible with MARC 21.

An introduction to the Metadata Object Description Schema (MODS) Sally H. McCallum

MODS fits together with other XML schemas and tools that are available through the MARC Web site[1]. These include a MARC 21 in XML (MARCXML) that is completely roundtrip compatible with MARC 21 in its ISO 2709 structure, i.e. the structure used currently for the format, along with transformations between the two (Corey Keith is reporting on MARCXML in this issue). Through MARCXML, transformations among MODS, ONIX and Dublin Core are also provided at the site in addition to several tools. An initial version of MODS was developed by the MARC 21 maintenance agency at the Library of Congress with a group of MARC 21 users. The draft (Version 1.0) was made widely available for review and trial use for six months in 2002 and a new version, that incorporated the changes suggested during review, was made available in February 2003 (Version 2.0). The MODS listserv has provided an essential avenue for user input and discussion of solutions. Based on comments received for the period February-August 2003, Version 3.0 was issued for review and comment in September 2003. The schema and accompanying documentation are available from the MODS Web site[2]. Various transformations to and from MARC 21 and Dublin Core (DC) are also available from that site. In this discussion, some key terms used by librarians will frequently be adapted to the new terminology of the environment in which MODS functions. A format will usually be

The authors Sally H. McCallum is Chief of the Network Development and MARC Standards Office, Library of Congress, Washington, DC, USA. Keywords Online cataloguing, Extensible markup language Abstract This paper provides an introduction to the Metadata Object Description Schema (MODS), a MARC 21 compatible XML schema for descriptive metadata. It explains the requirements that the schema targets and the special features that differentiate it from MARC, such as user-oriented tags, regrouped data elements, linking, recursion, and accommodations for electronic resources. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Received 17 September 2003 Revised 15 October 2003 Accepted 7 November 2003 # Sally H. McCallum.

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 82-88 Emerald Group Publishing Limited . ISSN 0737-8831

82

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

Although XML may appear to be a sudden development because of its rapid take-up, it is grounded in work of at least 20 years starting with the ISO standard for the Standard Generalized Markup Language (SGML). SGML became familiar to many through the SGML-conforming tag set, HTML, the primary markup used for the early Web. XML followed SGML and HTML, drawing from experience with both, and was made more flexible yet consistent and predictable. It is maturing and appears to have good staying power. Thus it is a ‘‘methodology’’ that the library environment can usefully move into for new long-term solutions. Unicode is another positive aspect of the XML environment that MODS takes advantage of. XML accommodates variable length data, explicit data tagging to multiple levels, hierarchical structure (even better than MARC), and ‘‘all possible’’ characters (through Unicode). It is likely that new-generation systems from the computer manufacturers will incorporate aspects of XML where useful and appropriate. Open source software is being built around the XML data structure, which makes manipulation of XML data easier (and cheaper), and enables what was formerly difficult. With XML, libraries would not always have to construct their own application tools. Therefore there was no question about using XML for MODS, just the options to be chosen for constructing the MODS schema. Among the new XML tools of special interest to the library community are several protocols and broader metadata carriers. The search and retrieval protocol Z39.50, which is widely employed for search interoperability between dissimilar systems, can return several types of records. Although it is largely used for MARC 21 and other non-XML record types, MODS formatted records would be an option. However, the Search/Retrieve Web Service (SRW), the ‘‘next generation’’ Z39.50 which is built on the Z foundation, is an XML protocol and XML-formatted records are the natural record types for SRW[4]. The harvesting protocol of the Open Archives Initiative (OAI)[5], also transports metadata records and being an XML protocol, XML records are the best fit. The records that the Library of Congress currently makes available

called a ‘‘schema,’’ because that is the word used to describe the specification for a tag set and its rules for use in an XML setting. Cataloging data will sometimes be called descriptive metadata to relate it to and distinguish it from other types of ‘‘data about data’’, such as technical, administrative, and rights, that play a prominent role with digital resources. Using this terminology, MODS is an XML schema for descriptive metadata. Crafting a schema in XML presents a developer with many different options. The basic components of XML are elements and attributes associated with the elements, all of which behave according to a set of rules for XML established by the World Wide Web Consortium (W3C)[3]. XML is both the markup used for MODS records and also a formal language for specifying the MODS schema. In the following discussion and examples of MODS markup these conventions will be used: MODS tag names will be enclosed in angle brackets (< >); attributes names and values will be in italic; data content elements will be in bold. Any of the above may sometimes be underlined to highlight them. While the angle brackets for tags are XML specifications, the italic, bold, and underlining are not.

Basic requirements for MODS MODS was designed with the following basic requirements as guidelines. XML environment One fundamental requirement was that MODS be developed for eXtensible Markup Language (XML) environments. The library community is both looking for and accustomed to long-term solutions. Library automation began with the start of the commercial computer era, in the 1960s, and libraries have built a robust data interchange and multisystem vendor environment. This has served libraries’ advanced requirements (variable length data fields when the norm was fixed length, and extensive character encodings long before Unicode) but it required specialized software and hardware to accommodate those requirements. 83

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

importance of supporting data constructed by the content rules currently used. For MODS it was also recognized that while the primary target is compatibility with MARC data, MODS records should be able to derive data from both Dublin Core records which are extremely simple, and from publishers’ ONIX records which are more complex and less compatible with library data.

for OAI harvesting may be obtained in any of three XML formats: MODS, MARCXML, and DC. An important use for MODS will be for the descriptive data in Metadata Encoding and Transmission Standard (METS) packages[6]. METS is an XML framework schema that carries all types of digital object metadata – structural, administrative, technical, and rights, in addition to descriptive. The metadata itself may be part of the METS package or external, and XML is preferred. Either MODS or MARCXML records may be included, but one attractive possibility is for the METS package to contain the (sometimes) briefer MODS record and perhaps point to the fuller MARC 21 or MARCXML record. An additional reason for using XML is that electronic resources themselves are often XML, XML-compatible, or specially XML-receptive so that metadata that is intended to accompany a resource is useful if encoded according to a recognized XML schema. While librarians would perhaps prefer a MARCXML record embedded in the header of a resource, MODS may be a more reasonable alternative, being simpler yet MARC-compatible.

Special accommodation of digital resources Based on experience from several digital projects, there is a requirement for some special accommodations for electronic resources in MODS. This type of material is currently presenting the greatest challenge to librarians. The volume is enormous and growing, traditional selection procedures are not sufficient for selecting for inclusion to collections, ‘‘adding to a collection’’ is not even a well-understood concept for this type of material, and preservation is still an unknown. There is an added complexity in that the material is easily changed, so that even electronic versions of scholarly journals can introduce changes and corrections, and versioning for Web documents is especially daunting. All these issues pointed to special accommodation even from the descriptive metadata viewpoint.

Highly compatible with MARC 21 MARC 21 has been a long-term solution for the library community for a descriptive metadata format, and as a result there are over a billion MARC 21 records in networks and local systems worldwide. There are thousands of MARC-based installed systems that carry out complex operations from end-user services to a library’s processing needs. There are also thousands of MARC-trained librarians who have responsibility for the organization of library resources. The community must therefore provide a continuity and connectivity between the existing MARC 21 records and any new XML record. They must be highly integrable so existing systems can use them and bibliographic specialists can construct and understand them. The difference in the record structure (ISO 2709 vs XML) is not critical, but the compatibility of the semantics of the record content with the records already created is. MODS was therefore designed as a MARC 21 derivative, with clear recognition of the

Simplicity With the growth of print and especially electronic resources, all descriptive metadata will not be equal. There will be resources that should have full description using the detail of a MARC record; others need simpler, perhaps even temporary, cataloging that can be partially derived from the resources or from other records such as ONIX or DC. As noted, all must be highly integrable with MARC 21 records. That is a target that MODS has tried to hit by specifying only core data and essential tagging. The top level of data elements in MODS has been kept modest, with 19 currently defined, although many of them have several subelements. A listing of those top elements indicates the completeness of coverage, however. Using the descriptive metadata categories defined in the Functional 84

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

Requirements for Bibliographic Records (FRBR) (IFLA Study Group on the Functional Requirements for Bibliographic Records, 1998) (hierarchical from the top down: work, expression, manifestation, and item), the top level MODS elements are:

Sound and fury : the making of the punditocracy

Thought has been given to writing transformations that would convert the tags to and from other languages. For this to work, the tags would have to be standard in other languages and a strict one-to-one correspondence would be necessary as they are critical for efficient information sharing. Part of the intent with the use of word tags was to enable creation of records, with less training on aspects of the format and content designation. With the great volume of electronic resources, technicians will be employed for minimal level and initial bibliographic descriptions. In the university setting these are often university student assistants who come and go with the school years, and a schema with easy to understand tags is important.

Work: , , , , , Expression: , , ,

Manifestation: , , , , , Item: ,

The MODS elements and are outside the basic FRBR framework. An accompanying concern was that MODS be simple enough for original description of a resource by non-professionals, with adequate guidance. This meant in addition to the relative simplicity of the schema, an accompanying document with use guidelines was needed (creation and use of that document and experience with MODS record generation are described in the paper by Rebecca Guenther in this issue).

Regrouped data elements Some data elements that appear in various fields in MARC have been brought together in different ways in MODS. Placement in MARC sometimes reflects post-coordinated growth of the format for the various types of material, differences in cataloging rules over time, and data relationships as perceived at a certain time. While retaining basic correspondence to MARC, MODS has in some instances brought similar-coded values or data together, or separated data that is combined in MARC. For example, MARC fields 440, 490, 534, 700-711 (when they contain subfield t), 730-740 (when indicator 2 has value 2), 760-787, and 800-830 all contain information about an item related to the resource being described. They come together under repeated occurrences of the MODS tag, differentiated with tag attribute values that indicate the relationship. The following example shows two related items for an item, the series and a component part.

Features of MODS The following features of MODS illustrate how the above requirements were fulfilled. User oriented tags For MODS the decision was made to use tagging that was more user-friendly in the sense that they can be read and generally understood by the uninitiated. While there is efficiency in brief tagging such as that used by MARC (three digit tags) their meaning must be learned or interpreted. The MODS tags are English language and few abbreviations are used. Examples are for the MARC 21 245 subfield a, for 008 genre information, and for 260 subfield b. The following piece from a MODS record shows the top-level tag and its subtags:

Music for voice and instrument



85

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

Another place where this occurred is with genre, which is primarily in coded data fields in MARC 21. MARC information in various 007/01, 008/21, 008/24, 008/25, 008/26, 008/26, 008/30-31, and 008/33 bytes and in the 665 field are combined under one tag, .

Tutto in pianto il cor struggete; arr.1984.

Joseph I, Holy Roman Emperor, 1678-1711



Fewer coded values For data elements that have widely established coded values, MODS accommodates specification of the code, but also provides for the text form of the element. An example is under above, where both nyu and Ithaca, NY are recorded. In other cases where there are not universal lists, word value lists have been established. In the example above, under the value print is used. This comes from a list established in coordination with the MARC codes values for the form of an item, as found in 008/23 and 008/29.

Another interesting area where like information was usefully brought together was the MODS . Under this tag is place of publication information from MARC 008, 044, and 260 subfield a; publisher information from MARC 260 subfield b; date information from MARC 260 subfield c, 033, and 008; edition information from MARC 250; and issuance/frequency information from the MARC Leader and fields 310 and 321:

Electronic resource data Since the schema was drafted with electronic resources as a primary target, several small but important elements were included that do not have good equivalents in MARC. Under , one is an indication of , with values born digital and formatted digital; and another is the , with values access, preservation, and replacement. Provision is made for the date that a Web site was captured as part of harvesting activities and also for the date of last access for remote Web resources. MODS also provides for differentiating between the URL-like strings used as identifiers, and those that are ‘‘actionable.’’ The former are in the tag, while the latter are under .

nyu Ithaca, N.Y

Cornell University Press

c1999 1999

monographic

The MODS tag brings together a number of pieces of related information about a resource. Here, form of item information in MARC 008 is combined with the data in fields 256, 300, and part of field 856. A physical description might look like the following:

Linking MODS takes advantage of the linking flexibility of the XML environment by providing for all of the top-level elements with attributes such as xlink and ID. The attribute xlink allows the encoding of a relevant link for the information, similar to links MARC allows in selected fields. MARC abstract (520) and table of contents (505) fields both have subfields for URLs that

print

1 score (12 p.) + 2 parts ; 31 cm.

86

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

link to relevant content. The attribute ID allows for linking internally.

XML tags Two basic XML structure-related decisions, involved the tagging rules to be used in the schema: mixing of content within a tag, and use of tags alone or tags and attributes.

Recursion MODS takes advantage of the hierarchical elegance of XML to define the structure for related items in a recursive manner. Each related item may be described with the same tags as the resource targeted by the record, although typically only a small subset is needed. The example above for a shows and elements being used recursively under the top level element.

Mixed content In early drafts, a number of MODS elements were defined as ‘‘mixed-content’’. This mixing of subelements with content is allowed if the schema declares it is to be used. For example, a mixed-content approach was considered for the MODS element, to indicate that a portion of a title is an initial article: Theworld of learning

Special attributes Some XML attributes occur in several MODS elements. Obvious ones give the authority for the data content, e.g. for classification, whether the class information is ddc, lcc, or some other classification authority. Another gives information on the encoding pattern for an element like a date. But the more interesting attributes are the following four: (1) lang; (2) xml:lang; (3) script; and (4) transliteration.

However, XML developers are reluctant to use the mixed-content approach because of the possibility of ambiguity and difficulty in referencing the content for processing, so MODS eventually decided against it. Thus in the above example, was enveloped in and individual elements were each tagged: The world of learning

Attributes vs elements A decision had to be made on whether to use attributes and elements in the schema or just elements. There is the question of whether attributes complicate processing of searches, displays, or transformations. No clear answer to that question has emerged, since there is not any real evidence that one or the other approach is better. Searching depends on the query language, and the development of XML query languages is still not mature. Another issue with attributes related to conversion of a MODS record to a format suitable for transmission via a Web protocol such as SOAP. This is an important issue because there is some thought that an XML attribute cannot be encoded as a SOAP parameter. However, a MODS record would be transmitted as a string, not with parameters visible to SOAP. In the final analysis, with no concrete evidence against the use of attributes, MODS defines attributes wherever they are a useful

MODS defines them for all top level elements. MARC does not provide for identification of these attributes at the field level, because there are issues associated with their use and problems with coding techniques. While these problems were recognized, the decision was made to include them to see if they could be useful in actual practice and perhaps some issues settled. Round trip transformation with MARC 21 Since MODS contains tagging for a subset of MARC 21 and also has a few elements not contained in MARC, round trip conversion without loss may not be possible in some instances. When this is important for a process or application, users could limit elements to those with one-to-one correspondences. In many cases the loss is slight, but data can always be preserved by using local tagging in MARC or the element in MODS. 87

An introduction to the Metadata Object Description Schema (MODS)

Library Hi Tech Volume 22 . Number 1 . 2004 . 82-88

Sally H. McCallum

construct. The general rule used is: if the information is ‘‘content’’ it is made an element; if the information is used to aid in the processing of content, it is treated as an attribute. There are also two cases, however, where information must be cast as an element and not as an attribute: (1) if the information is repeatable; and (2) if it is structured (since XML elements may have subelements but attributes may not).

Notes 1 The MARC 21 Web site contains format information including the MARCXML tools and schemas. It links to the MODS site, and is available at: www.loc.gov/ marc 2 The MODS Web site contains information on the development of the schema, user guidelines, downloadable transformations, and the schema itself, available at: www.loc.gov/mods 3 On the World Wide Web Consortium Web site, available at: www.w3.org/XML/ 4 For information on Z39.50 and links to the SRW Web site, available at: http://lcweb.loc.gov/z3950/ agency/ 5 Available at: www.openarchives.org/ 6 Available at: www.loc.gov/mets/

Conclusion MODS is viewed as providing an evolutionary pathway forward for libraries. It attempts to take into account the rapid increase in electronic resources, the community’s economically-deep commitment to MARC data elements, the proliferation of formats and schemas beyond library community control, and the rapidly growing XML tool environment. MODS derives from MARC 21 while taking advantage of XML and the flexibility, tool development, and transformation options it offers.

Reference IFLA Study Group on the Functional Requirements for Bibliographic Records (1998), Functional Requirements for Bibliographic Records: Final Report, K.G. Saur, Mu¨nchen, UBCIM Publications. New Series; 19, also available on IFLANET at: www.ifla.org/VII/s13/frbr/frbr.pdf or www.ifla.org/VII/ s13/frbr/frbr.htm.

88

The need for a rich XML-based descriptive metadata standard such as the Metadata Object Description Schema (MODS) has been expressed by members of the digital library and related communities as they attempt to implement projects involving search and retrieval, management of complex digital objects, integrating metadata from library databases with other non-MARC sources, and other functions. There are several applications that have a metadata component for which it is desirable to have richer metadata than offered by the Dublin Core Metadata Element Set, but that is less complex and more user friendly than full MARC 21 with its numeric tags (Dublin Core Initiative, 2003). Sally McCallum discusses some of these in her article in this issue. Users of MODS benefit from the energetic and systematic support of the Library of Congress staff and others who have developed the MODS fields based on vast and broad experiences. LC has supplied detailed documentation, examples of usage, and other supporting materials such as guidelines and local policy to facilitate the use and adoption of MODS. It will remain well supported, maintained and documented because of the commitment of the Library of Congress to standards development and the contributions of outside participants.

Using the Metadata Object Description Schema (MODS) for resource description: guidelines and applications Rebecca S. Guenther

The authors Rebecca Guenther is Senior Networking and Standards Specialist, Network Development and MARC Standards Office of the Library of Congress, Washington, Dc, USA. Keywords Online cataloguing, Standards, Archives Abstract This paper describes the Metadata Object Description Schema (MODS), its accompanying documentation and some of its applications. It reviews the MODS user guidelines provided by the Library of Congress and how they enable a user of the schema to consistently apply MODS as a metadata scheme. Because the schema itself could not fully document appropriate usage, the guidelines provide element definitions, history, relationships to other elements, usage conventions, and examples. Short descriptions of some MODS applications are given and a more detailed discussion of its use in the Library of Congress’ Minerva project for Web archiving is given.

MODS user guidelines Need In order to use a metadata element set, particularly for the creation of new records, it is necessary to have both the available elements and their meanings (semantics), a method for encoding them (in the case of XML, a schema), a system that allows for input, and ultimately a system that can effectively use them. The Network Development and MARC Standards Office of the Library of Congress developed MODS initially as a MARC derivative, both as an element set (essentially a subset of MARC with different element names) and as a syntax

Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Received 12 September 2003 Revised 13 October 2003 Accepted 7 November 2003 # Rebecca Guenther

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 89-98 Emerald Group Publishing Limited . ISSN 0737-8831

89

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

all ISBD punctuation, and it may be retained in the data after conversion. For new MODS records, the principle should be followed that punctuation between elements is generated by a stylesheet if ISBD display is desired.

(in XML using element names). It became clear early on that the schema itself could not fully describe each element and guide users to its application. Hints in the schema are given using the documentation tag, which points the user to the appropriate equivalent MARC element. This, however, was not sufficient for applying MODS in many cases where there was not a clear mapping or more information on usage was required. LC has therefore provided extensive guidelines.

Attributes used throughout the schema There are a number of attributes that are referenced from different elements throughout the MODS schema. Because XML schema allows for complex types that can be reused, this brings a high level of consistency to the element definitions. (There are numerous places in MARC that define similar elements the same way, although constraints inherited in the limited number of subfields that may be included does not always allow for full consistency.) Many of the attributes used widely throughout the schema concern language issues regarding the metadata. The schema allows for attributes for each top level element in MODS to indicate the language, script and transliteration scheme of the metadata in the given element. The guidelines elaborate on how these attributes should be used, and points out when they do not exist at certain levels in MARC 21.

General overview The MODS user guidelines attempt to provide the user with element definitions, history, relationships to other elements, usage conventions, and examples (Library of Congress, n.d., a). The guidelines are primarily intended to be used for assistance in creating original MODS records, although they may also be instructive for interpreting MODS records that have been converted from MARC 21 or for use in developing detailed conversion specifications. Definitions are often in the form of links to MARC documentation, where MODS elements inherit MARC semantics. A mapping may be provided as well as a reference to a definition. The guidelines point out when an element is used in a different way than in MARC 21.

Detailed description of elements The guidelines describe each element (including its subelements) and attributes in detail. It provides links to the MARC 21 documentation for equivalent elements, specifies which MARC 21 fields are equivalents and which fields are collapsed to one MODS element (since MODS is a subset of MARC), indicates under what conditions an element is repeated, and gives recommendations on best practice. The guidelines are particularly useful for elements that do not exist in MARC 21 or do not have a direct equivalent, since in these cases there is generally no other documentation to reference definitions. For instance, the element under does not have a direct equivalent in MARC. It is related to 245$c (statement of responsibility), although this mapping is misleading, since that MARC subfield often carries other data than merely the display form of a name. (For this reason, the MARC to MODS mapping uses a note with a type attribute ‘‘statement of responsibility’’.) The entry for displayForm is as follows:

Introduction and implementation The guidelines include a basic review of XML structures with definitions and examples. The implementation section includes the relation of MODS to cataloging rules and full MARC 21, how to handle punctuation, information about general element repeatability and order, and instructions on encoding multiple MODS records versus a single MODS record. Punctuation The MODS guidelines specify that punctuation is basically generated based on the particular element (for instance ISBD punctuation), except where it occurs within an element. This is accomplished with an XSLT stylesheet, which inserts the required punctuation on the basis of the tag. However, since there will be cases when a MARC 21 record is converted to MODS, it may not always be possible to extract 90

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

born digital – A resource was created and is intended to remain in digital form. reformatted digital – A resource was created in a non-digital form and was converted into a digital form.

‘‘displayForm’’ is used to indicate the unstructured form of the name. It includes a name as given on the resource. For some applications, contractual agreements require that the form of the name given on the resource be provided. This data is usually carried in MARC 21 field 245 subfield $c, although the latter may include other information in addition to the display form of the name. If part of an entire statement of responsibility, it may be indicated in a note (with type=‘‘statement of responsibility’’) along with any other text in the statement of responsibility and also repeated here if desired.

A few specific MODS elements are discussed below in order to illustrate the use of the guidelines. MODS notes In order to accommodate a wide range of types of note fields by various different applications, the MODS team at LC decided to allow for an open-ended list of MODS types. To enhance the likelihood of interoperability between applications, the Library of Congress (n.d., b) has made available a list of note types used by implementers, which is referenced in the guidelines. Much of the input to the list at the time of the writing of this article is from the University of California at Berkeley (n.d.), which used its own schema in their database for digital objects, generic descriptive metadata (GDM), but is converting to MODS. The list will be continuously updated as users submit note types that have been input into their MODS records. Users are free to use any value for the note type attribute regardless of whether their types have been submitted to LC.

Many elements are packaged differently in MODS than they are in MARC 21. For instance, the MARC Leader/07 (bibliographic level) includes several different concepts, including issuance (whether serial or monographic), granularity (whether a collection or single item), and method of update (whether continually updated or static). In MODS an element is used under with enumerated values ‘‘monographic’’ or ‘‘continuing.’’ Whether or not the item is part of a collection is indicated as an attribute associated with , since any of the enumerated high level resource types may be collections or single items (collection=‘‘yes’’ indicates that the MODS record describes a collection). In addition, MODS uses the attribute manuscript with a value of ‘‘yes’’ or ‘‘no’’ in to indicate whether the high level resource type is in manuscript form (MARC has separate Leader/06 (type of resource) values for these). The guidelines relate these elements and attributes to the appropriate MARC element and describe how they are to be used. Because MODS was developed particularly with electronic resources in mind, there are a number of elements that give information specific to these types of resources. For instance, under there is an element which does not exist in MARC 21, but was felt to be useful for description of electronic resources. It designates whether the item is born digital or reformatted digital. The definition is as follows:

Related item The element is a powerful feature in MODS in that it allows for rich linking and description of related resources. The MODS guidelines are essential in describing the intent and features of this element because of its complexity. includes an attribute ‘‘type’’ to indicate the type of relationship between the resource described in the body of the MODS record and the resource described in . The guidelines explain that there are several MARC 21 fields that are mapped to . These include the 7XX added entry fields, but only those that contain a title subfield (subfield $t), since the existence of a title subfield denotes an analytical author/title entry (i.e. the work or part of the work is contained in the resource). Thus, in these situations the type attribute has a value of ‘‘constituent’’ (i.e. contained in the work

‘‘digitalOrigin’’ designates whether a digital resource is born digital or reformatted digital. It does not exist in the MARC 21 formats, however, it is beneficial in the description of digital materials. The following values may be used:

91

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

value ‘‘local’’ may be used, or the specific named identifier scheme may be recorded, since this is not a controlled list.

described). In addition, all the linking entry fields (76X-78X) map to ; in some cases multiple MARC linking entry fields are mapped to the same type. For example, field 770 (supplement/special issue entry) and field 774 (constituent unit entry) both are mapped to type=‘‘constituent’’, since 770 denotes a supplement contained in the whole, which is a special type of constituent. It was felt that this distinction was not necessary to retain in MODS. Series was considered yet another relationship between resources, so type=‘‘series’’ is used for series statements, mapped to various MARC fields that include series information (440, 490, 760, 762, 800, 810, 811, 830). The element is fully recursive so that any MODS element may be used. This allows for rich description and the ability to designate hierarchical relationships. The guidelines elaborate that full description for constituent items (using type=‘‘constituent’’ and any MODS element) may be included, which might be suitable for complex digital objects that require specific descriptive information for constituent parts, but that are considered intellectually one object (e.g. a CD with several tracks, digitized or analog). Some of the elements available in MODS in are not available in the MARC linking entry fields, and further development of the latter is limited by available subfields, which is restricted to using numerics 0-9 and one-character alphabetic subfield tags. In addition to all the defined MODS elements, in MODS version 3.0 includes elements for parsing the details of citation numbering, to be used when the MODS record describes an article. This will enable the construction of an OpenURL or a formatted citation.

Extension Because the extension element has no substructure in the MODS schema and no clear equivalent in MARC (although it can be mapped to 9XX), there is little guidance in the schema on how to use it. The MODS guidelines describes situations that warrant its use and gives some examples. The element may be used for local elements, similar to MARC 21 9XX fields, that the user wants to include but are undefined in MODS. In addition, it is a mechanism for including elements from other XML schemas that may enhance the MODS record. MODS Lite Although MODS includes fewer data elements than MARC 21, one can look at the schema as fairly complex. Allowing for a simpler expression of MODS results in better crosswalking to other simple schemas, such as Dublin Core. Various transformations may be possible using stylesheets and software tools developed at the Library of Congress’ Network Development and MARC Standards Office, which is part of the framework for working with MARC data in an XML environment. Corey Keith’s article in this issue details MARCXML developments. A ‘‘MODS Lite’’ is described in the MODS guidelines to show how a simple MODS expression might be used. It shows how to express metadata using only MODS elements that are equivalent to the 15 elements of the simple Dublin Core Metadata Element Set. The substructure defined in the MODS schema is used to allow for records that validate, but elements that do not have a direct equivalent in Dublin Core are not included (except that record information is given in MODS Lite but not in Dublin Core). Various subsets of elements from MODS could define other MODS ‘‘Lites’’; the Minerva project detailed below uses a different subset. In the future, the Library of Congress will publish other MODS Lites that can be used for other applications.

Identifier The identifier contains a standard number or code that identifies the resource. The guidelines explain that there is no controlled list of values for the type attribute, which designates the identifier scheme used for formulating the identifier recorded. It gives a list of suggested values for identifier scheme, however, and states that others may also be used. If the user desires to record a local identifier either the type 92

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

Additional features The guidelines include full record examples to assist the user in seeing the relationship between MODS elements. It is expected that over time this section will grow considerably as more implementations contribute full MODS record examples. In addition, there is an index by element name regardless of its hierarchy in the element list to facilitate finding information quickly within the guidelines. Of course, users of MODS Lite can extend the structure to include other elements that they feel are essential, maintaining consistency with both MODS and Dublin Core.

.

.

Applications MODS has been gaining some attention as a descriptive metadata scheme of particular interest for digital materials. The existence of both tools for transforming records between schemes and extensive guidelines should facilitate its further adoption. This section will describe some applications of the MODS schema. Although this article will focus on the application of MODS to the Minerva project (Mapping the INternet Electronic Resources Virtual Archive) it is worth noting some additional MODS metadata projects (Library of Congress, n.d., c). A number of these are METS projects that use MODS as the descriptive metadata extension schema. This list is not comprehensive, but illustrates some applications of MODS. . The University of Chicago Library (Olson, 2003) is crosswalking MARC to MODS for use as descriptive metadata embedded in METS objects for one of its digital library collections. In this case, all of the electronic objects were created from physical objects in their collection which already had detailed cataloging records. (MODS list 26 June 2003.) . LC Digital Audio-visual Preservation Prototyping Project (Library of Congress, n.d., d) explores aspects of digital preservation for audio and video. It packages digital content and metadata using METS. Descriptions of objects and subobjects use MODS. Where previously

.

.

93

cataloged descriptions exist in the Library of Congress catalog, the MARC records are transformed to MODS and loaded into the database used. Where they are not available, original MODS records are created. OAI Harvesting. The Library of Congress incorporates MODS as an alternate format for its over 100,000 metadata records for items digitized for American Memory that are harvested (items represented include books, maps, photographs, early movies, sheet music, and printed ephemera) (Library of Congress, n.d., e). University of California Bancroft Library is using an SQL Server to gather descriptive metadata to be used in the context of METS objects. The database, originally designed around the MOA2.DTD, gathers all of the descriptive, administrative and structural metadata necessary to generate METS objects with MODS descriptive metadata and (minimal) MIX image technical metadata. Although the descriptive metadata were not designed around MODS but long before it existed, the elements have been mapped to MODS for use as descriptive metadata (Beaubien, 2003). Some users are experimenting with MODS for bibliographies and generally bibliographic citations. Version 3.0, which at this writing is undergoing review, adds additional elements to to accommodate parsed citation information that will assist in coding bibliographic citations and generating an OpenURL link. The National Library of Australia is using MODS for several projects. It is being used as an exchange format to facilitate harvesting and conversion of data in multiple original formats/schemas to MARC and subsequent loading into Australia’s National Bibliographic database (NBD). This allows NLA to include records from non-MARC, non-traditional contributors in its union catalog by loading them after converting by using XSLT stylesheets. In addition MODS will be used as a storage format for the MusicAustralia

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

Development and MARC Standards Office is currently working on tools for the creation of MODS records and the conversion of MODS to MARC so that the records can be brought into LC’s online catalog. The MINERVA collection on the theme of Election 2002 is the first to provide full metadata records for each of the sites collected. It is a selective collection of nearly 4,000 sites archived between 1 July, 2002 and 30 November, 2002. The Election 2002 Web archive includes Web sites produced by congressional and gubernatorial candidates as well as party, interest group, press, government, civic, and other selective Web sites related to the 2002 national and statewide elections. For MINERVA collections, the Library of Congress creates collection-level cataloging records that are accessible in LC’s integrated library system. This is being supplemented by records for each Web site within the collection (one record per ‘‘base URL’’, which includes multiple snapshots of different dates captured) which are used within the Minerva Web archive search system. LC decided to use MODS for its record format because of its flexibility in level of markup, its compatibility with existing MARC records in the ILS, and its usefulness as a descriptive metadata extension schema with a METS object in a future repository.

service, which involves gathering data from various contributors – libraries, archives, museums, the commercial sector – via the Harvester project, and loading all records into the NBD. NLA extracts the records it wants from the NBD and converts them to MODS for the MusicAustralia service itself (Ayres, 2003).

MINERVA project MINERVA is an experimental pilot developed to identify, select, collect and preserve open-access materials from the World Wide Web. The effort includes consensus building within the library; joint planning with external bodies; studies of the technical, copyright and policy issues; the development of a long-term plan; and coordination of prototypes. A multidisciplinary team of library staff representing cataloging, standards, legal, public services, and technology services is studying methods to evaluate, select, collect, catalog, provide access to, and preserve these materials for future generations of researchers The project is a collaboration between the Library of Congress, the Internet Archive (Alexa), State University of New York (SUNY), and the University of Washington. The latter groups have assisted in identifying content and in using tools of their design to assign metadata descriptions to the Web sites collected. This metadata database is then used to search, retrieve and analyze the archived collection of Web sites. The contractors collected and archived Web sites based on LC specifications focused on themes concerning the presidential Campaign 2000; 11 September, 2001; Winter Olympics 2002; Election 2002; and the 107th Congress. Metadata is created for Web sites in the collection using the MODS schema because of its compatibility with MARC data. The data in the MODS records is used in the search and retrieval system for Minerva. The library is also beginning to experiment with METS to enable the inclusion of additional metadata, such as technical, provenance, and rights metadata. An important part of this work involves defining a METS application profile for Web sites. To support the project, LC’s Network

Metadata preparation for Election 2002 Web Archive The contractors used specifications supplied by the Library of Congress concerning which MODS elements to use and how to populate them. A small number of MODS elements are used for the resource descriptions. Following are the fields included along with descriptions. An XSLT stylesheet is used to present the data in the form of a catalog record that is easy to understand. . Title. Title data is taken from text included within the title tag of the HTML source file of the resource at the URL being described. In addition, contractors constructed an alternative title based on information about the site such as candidate name, party, office, state (or city), since HTML titles may not always reflect the true content. If the HTML title was not unavailable, they used this constructed title as main title. LC 94

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

.

.

.

.

.

.

gave the contractors specifications for what data to include for each type of Web site. Categories include: candidate, citizen, civic and advocacy, government, political party, press, public opinion and miscellaneous. Name. A name is defined as a personal or corporate entity related to the resource; it may be the creator or issuing publisher. Contractors were instructed to include the name of the entity that appears to be primarily responsible for making the content of a Web page, as identified by reference to text or graphics on the captured page, or by reference to an ‘‘about us’’ page linked from the page. Name is encoded in a structured form, i.e. lastname, firstname and type of name is indicated (personal, corporate). Names are not authoritative in that they have not been validated against an authority file. Abstract. A brief description of the site associated with a Web page, is given, generally generated from data collected about the site, referencing a possible site producer and identifying a possible purpose of the site. If available from a meta tag ‘‘description’’ within the source, it may be extracted. Date captured. This is the archived time associated with the Web page which was archived between 1 July, 2002 and 30 November, 2002. This may be a range, with start and end dates (MODS allows for an attribute associated with the date to indicate whether it is a start or end date). The date of the first iteration and the date of the last iteration for each URL is included (Web sites were captured periodically over this period of time, so each has many iterations). Genre. Each URL is identified as ‘‘Web site’’. This data is generated since it is always the same. Physical description/format. Contractors were instructed to provide a list of the distinct file formats expressed as standard Internet media types (e.g. text/HTML, image/jpeg) of all archived objects associated with a Web page. Related item. The collection name of which the resource is a part (with type=‘‘host’’), e.g. Election 2002 Web archive and its

.

.

.

.

URL is included. This provides a link to the collection Web site as well as its metadata. This field is generated, since the data is the same for each record in the collection. Identifier. Two identifiers are given. One is the URL of the site which was collected, used with a display label ‘‘Active site (if available)’’. Because it is possible that a site could have disappeared since the election (and indeed many have), a hot link is not provided for this URL. A second identifier is also included for the location of the archived site with an attribute displayLabel ‘‘archived site.’’ A hot link is provided for the archived site URL associated with the dates of capture, which allows the user to go directly to the archive that contains the captured files. Language. The primary language of a Web page is identified using the ISO 639-2 bibliographic code (e.g. eng, fre). The XSLT stylesheet that generates the displayed record converts the code into text for the user display. Access condition. An identifier or statement of access terms and conditions provided by the library is given if there are access or use restrictions associated with a Web page. Otherwise the text ‘‘none’’ is generated. Subject. Library of Congress Subject Headings (LCSH) are used based on guidelines provided. In many cases these are generated using rules for each Web site type provided by LC and based on data already collected about each site. Some sites, such as citizen groups, do not have controlled subject headings, because it was not possible to provide consistent rules for supplying these. There are also cases where the rules were complex and the instructions resulted in headings that were not completely consistent with LCSH; these will be converted later, since they are predictable situations (an example is the use of postal codes for states instead of abbreviations specified in LCSH). If available from metadata in HTML tags keywords may be extracted and included as uncontrolled subject terms.

Figures 1-3 show an example of the XML coding for one Web site from the Election 2002 95

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

Figure 1 MODS document for candidate Fran Ulmer Web site

Web archive (Figure 1), the user display of the same record (Figure 2), and the result of following the link to the archived Web sites (Figure 3).

particularly the title, subject headings, and languages. As with Election 2002, the contractor wanted to generate as much metadata as possible (for instance, the abstract and constructed title are generated from data concerning the producer and producer type for each site). The sites in the 11 September Web archive were much more diverse than the Election 2002 collection, so more individual analysis of sites was necessary and the metadata is not as rich. For instance, it is only possible to supply very general subject headings for each individual site without detailed subject analysis.

Metadata for additional collections Although the 11 September Web archive was collected between 11 September, 2001 and 1 December, 2001 and the archived Web sites made available, the metadata is currently in the process of being completed and a search system established. This collection required different specifications for the content of fields, 96

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

Figure 2 Record display for Fran Ulmer Web site

Figure 3 Resource page for archived Web site

The Library of Congress is planning to provide metadata using MODS for the 107th Congress Web archive by using tools developed in-house. The resulting records will be of high quality because LC staff will be providing the content and will apply controlled vocabulary (name and subject headings). METS objects will be created for the sites to include technical

and administrative metadata as well as descriptive.

Conclusions MODS draws on the well-developed bibliographic description traditions that the 97

Using MODS for resource description: guidelines and applications

Library Hi Tech Volume 22 . Number 1 . 2004 . 89-98

Rebecca S. Guenther

Beaubien, R. (2003), Message to MODS List, e-mail message, 14 July. Dublin Core Metadata Initiative (2003), Dublin Core Metadata Element Set, Version 1.1: Reference Description, Web document, available at: www.dublincore.org/documents/dces (accessed 12 September 2003). Library of Congress (n.d., a), Metadata Object Description Schema. MODS User Guidelines, Web document, available at: www.loc.gov/standards/mods/modsuserguide.html (accessed 12 September 2003). Library of Congress (n.d., b), Metadata Object Description Schema. MODS Types, Web document, available at: www.loc.gov/standards/mods/mods-notes.html (accessed 12 September 2003). Library of Congress (n.d., c), MINERVA: Mapping the Internet Electronic Resources Virtual Archive, Web document, available at: www.loc.gov/minerva/ (accessed 12 September 2003). Library of Congress (n.d., d), Digital Audio-Visual Preservation Prototyping Project, Web document, available at: http://lcweb.loc.gov/rr/mopic/avprot/ (accessed 12 September 2003). Library of Congress (n.d., e), American Memory. Sample Requests Using the Open Archives Initiative Protocol for Metadata Harvesting, Web document, available at: http://memory.loc.gov/ammem/oamh/ oai_request.html (accessed 12 September 2003). Olson, T. (2002), Message to MODS list, e-mail message, 26 June. University of California at Berkeley (n.d.), GDM: A Descriptive Metadata Extension Schema for the Metadata Encoding and Transmission Standard, Web document, available at: http://sunsite.berkeley.edu/ mets/gdmSchema.html (accessed 12 September 2003).

library community has contributed to the organization of knowledge. By retaining some of the richness of the MARC element set and replacing the syntax with XML and the more friendly language-based tags, MODS allows for rich resource description that is compatible with the huge numbers of MARC bibliographic records existing in library systems. In addition, as an XML descriptive standard, it provides the flexibility to be combined with other XML-based standards such as METS to satisfy needs for the digital library environment. Because of the detailed documentation that the Library of Congress provides in the MODS guidelines, mappings, and examples, MODS can be deployed by users without having to provide their own application instructions. In addition, tools are provided for transformations to and from MARC and to other metadata schemes. LC has a firm commitment to support, maintain and document its standards. As MODS is used more widely and for other applications within and outside LC, it will evolve to fill the need for a rich descriptive standard that is compatible with MARC and that benefits from the flexibility and power of the XML syntax.

References Ayres, M.-L. (2003), Message to MODS List, e-mail message, 25 August.

98

Reinventing HTML

About XML Whither HTML?

Once HTML 4 was published, the HTML Working Group closed for business. It was assumed that developments in XML and the increasing use of cascading style sheets (CSS) would make further maturation of the standard unnecessary. Six months later, the group was reformed and, nearly six years later, it is busy producing controversial new versions of HTML.

Judith Wusteman

Where are we now? What happened to reactivate the HTML Working Group was the realisation that HTML was going to be around for some time after all and that an XML version might be a good idea. It was perceived that such a version might make it easier to repurpose content for the variety of browser devices that was beginning to appear. And so XHTML began its evolution. As defined in the XHTML 1.0 specification:

The author Judith Wusteman is based at the Department of Library and Information Studies at University College Dublin, Dublin, Ireland. Keywords Hypertext markup language, Extensible markup language, World Wide Web, Development

XHTML is a family of current and future document types and modules that reproduce, subset and extend HTML 4 (W3C, 2000a).

Abstract

It has already gone through several versions since XHTML 1.0 was initially published in January 2000. The latter was simply HTML 4 reformulated in XML. It could be used with X*ML software but would also operate in HTML 4 conforming user agents by following some simple guidelines. In other words, it was a route for entry to XML world but with backward compatibility. The three ‘‘flavours’’ of XHTML 1.0 were Strict, Transitional and Frameset. The Strict version aimed at separation of content and presentation and was designed to be used in conjunction with CSS. The Transitional version turned a blind eye to some presentational features, such as bgcolor, text and link attributes in the body tag, and hence catered for those browsers that could not understand style sheets. The Frameset version was the least important of the three as, by then,

DHTML has reinvented itself as an XML application. The working draft of the latest version, XHTML 2.0, is causing controversy due to its lack of backward compatibility and the deprecation – and in some cases disappearance – of some popular tags. But is this commotion distracting us from the big picture of what XHTML has to offer? Where is HTML going? And is it taking the Web community with it? Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 99-105 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524611

Received December 2003 Revised December 2003 Accepted December 2003

99

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

the move away from the use of frames had begun in earnest. The next stage was the modularisation of the elements and attributes into ‘‘convenient collections for use in documents that combine XHTML with other tag sets’’ (W3C, 2001a). XHTML Basic (December 2000) (W3C, 2000b) was a minimal build of such modules, targeted at mobile applications. On the other hand, XHTML 1.1 (May 2001) (W3C, 2001b) was a modular version of XHTML 1.0 Strict and was designed to ‘‘serve as the basis for future extended XHTML Family document types’’. In other words, separation of presentation and content, along with modular design, would be future hallmarks of HTML. And now XHTML 2.0 is emerging; the first working draft (WD) appeared in August 2002 and the fifth and latest in May 2003 (W3C, 2003a). XHTML 1.1 introduced modularisation; XHTML 2.0 substantially updates it and, importantly, introduces some external modules in the form of XForms, XML Events and Ruby. According to Steven Pemberton, HTML Working Group chair, the aims of XHTML 2.0 include more ‘‘usability, accessibility, internationalisation and device independence’’ (Dyer, 2003).

XHTML 2.0: the changes Unfortunately, what has made the news with XHTML 2.0 is not the bigger vision but the list of features that have been dropped or deprecated (in other words, ‘‘in the process of being removed’’ (W3C, 2003a)). Many of the changes can be grouped according to the two XHTML design aims of less scripting and less presentational markup. Less scripting In XHTML 2.0, the Working Group aimed to add ‘‘new functionality that people have been looking for, that they need and have been doing up to now with scripting’’ (Dyer, 2003). There are four main developments here: XForms, XML Events, Ruby and XFrames. All are to be welcomed, not least because they all reduce the dependency on scripting. XForms, the next generation of forms for the Web, is a major advance, as I explained in my

previous column (Wusteman, 2003). It relies on XML Events which provides a generic method of specifying the actions that should happen when events occur. XML Events does not just support predefined events, such as a mouse clicks, it also allows the developer to define their own events and indicate what should happen when they are triggered (Chase, 2002). The inclusion of the Ruby module in XHTML 2.0 is of particular relevance to East Asian Web documents. Ruby annotations are short runs of text alongside base texts, most often used to provide pronunciation guides. Finally in this category of changes is XFrames, not strictly part of the XHTML standard, although the first WD was released within 24 hours of the first XHTML 2.0 WD. This is a welcome relaunch of a feature that was always a great idea in theory but always problematic in practice. The lack of usability of traditional frames, – the unintuitive workings of the back and reload buttons and difficulty in bookmarking and searching – can all be solved using JavaScript. XFrames solves them without scripting. The basis of this simple solution is to make the URI (Uniform Resource Identifier) for each frame content part of the URI for the overall document. So when the user moves to a different frame for part of the page, the overall URI of the page changes. The move away from scripting is a very positive step for the Web. However, it does epitomise an approach to HTML and the Web in general that could be described as more purist than its predecessor and that will, as a result, alienate some. XForms and XML Events, in particular, although certainly more academically satisfying, are also more abstract in concept. In the case of XML Events, as with XSLT, the less common declarative programming paradigm is used rather than the more prevalent procedural one. Less presentation, more structure The purist theme is continued with a further move towards complete separation of presentation and structure. The changes that are most clearly identified with this process are: . The deprecation of the br element and its replacement with the line element, l. br is a presentational element whereas l describes

100

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

.

.

.

the content’s function. But the semantic difference appears to be very slight for the inconvenience it will cause with legacy data. The removal of other presentational elements such as b, i, big, small and tt. This is controversial; as Glazman (2002), author of Netscape and Mozilla Composer, points out ‘‘you don’t always want to add semantics to a piece of text you want to see in bold-italic’’. An attempt to abandon the inline style attribute – and its reintroduction in WD 5 with the rider that its use is ‘‘strongly discouraged’’ (W3C, 2003a). This reinstatement may have something to do with Glazman’s comment (2002) that dropping it was enough ‘‘to make XHTML 2.0 an (sic) harmful decision’’. So we will still be able to define styles directly within elements, for example

. The supplementing of headings h1 to h6 with new elements h and section, representing unordered headings and sections. For example: Whither HTML? XHTML 2.0: The changes Less Scripting Less Presentation, More Structure

could be rewritten as: Whither HTML? XHTML 2.0: The changes Less Scripting

..

Less Presentation

An advantage of the new elements is that sections can now be self-contained and not just identified by the position of headings. So, for example, information connected to a section could appear before the heading, as in the section above headed Less Presentation. The disadvantages are that it complicates what can be very simple. For example, if the heading level can only be inferred from the depth of section nesting, what happens if you want to cut and paste?

h1 to h6 are not yet deprecated. ‘‘The working group has not yet addressed this suggestion’’ (W3C, 2003a) but no doubt they will be deprecated in the future. The introduction of h and section would not cause too much consternation until h1 to h6 are deprecated – and then people will get very upset. The aim of all these changes is to enforce the marking up of content according to function and the use of style sheets for appropriate presentation. As the WD explains, ‘‘HTML is at heart a document structuring language. XHTML2 takes HTML back to these roots’’ (W3C, 2003a). I would not pretend that this argument does not have huge academic appeal. And, with devices such as voice-activated systems just around the corner, it is true that identifying the ‘‘structural meaning of content’’ is becoming commercially relevant for the first time (Chase, 2002). Unfortunately, the changes listed above, and a few other backward incompatibilities such as deprecation of the img tag, are just too traumatic for some. The disappearing img The change that has hijacked discussion of XHTML 2.0 is the deprecation of the img element in favour of object. applet and the non-standard embed are similarly afflicted. object appeared in the HTML 4.0 specification but has only really been used for embedding multimedia and Java applets. The object element has the advantage that, if an object can not be displayed, that object’s contents will be displayed instead. In the following example, if a PNG image can not be displayed, a GIF will appear. Failing both, a text description will be inserted. Notice that object also specifies the image’s MIME type, another nice addition.



Journal coverpage



101

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

But, as Pilgrim (2003) explains, object would not work properly in Internet Explorer (IE) because it was introduced as a Microsoft-specific tag in IE 3.0 for embedding ActiveX controls. Even if ActiveX is enabled in a particular browser, the image is rendered with ‘‘a nasty border and two sets of scroll bars’’. And IE does not allow nesting – it just renders all the images. Ironically, Pilgrim’s clever workarounds to allow the use of object tags involve scripting. Of course, the W3C could not have predicted the Eolas patent suit (W3C, 2003b). The latter covers the automatic launching of embedded objects such as Flash, Real Player and PDF readers. Hence, it affects certain uses of the object element. The full implications are not yet clear; although Microsoft has lost the case, it is appealing. It has also stated that, in its view, violation of the patent can be avoided by embedding active content using . . . scripting. It does not appear to be the ideal time to move to object. Why did the authors of XHTML 2.0 choose to deprecate img? It is simple, it works and it has helped to make the Web a success. However, it has never received the approval of some SGML purists, maybe because it is one of the few HTML features that does not have a basis in SGML. But anyone who has ever tried to display images in SGML will understand why HTML did not follow the same model. A theme in the evolution of HTML appears to be coercion. It would be great if the W3C could persuade everyone to write aesthetically pleasing HTML. But trying to force a philosophy on a very large installed market place is usually not successful. Pragmatism is necessary in the real world. A pragmatic approach would encourage people to use object without throwing away img. And it would extend the img tag to enable the features allowed in object, such as an attribute for MIME types. Why not? One of the attractive new features in XHTML 2.0 is that any element can now be a link; the href attribute has joined a group of attributes common to all elements. But if it was possible to add the href attribute to any element, why

was it not possible to add a MIME type attribute to img? Of course, it is not perfect but the img tag is perceived as being a cornerstone of HTML; even if it is broken, people still want to use it.

The bigger vision According to Chase (2002), the ‘‘big news’’ in XHTML 2.0 is that backward compatibility has been dropped. But that is to miss the bigger vision. At the end of the day, the removal or addition of individual elements may infuriate or inspire but it is not going to have a lasting impact on the evolution of the Web. The features of XHTML that will have such an impact, if they are allowed to, are well-formed XML and modularisation. These are developments that could change the future of the Web. Well-formed XML ‘‘When we produced XHTML in 1999’’ says Pemberton, ‘‘many people didn’t see the point. . . . The initial advantages were very small,’’ (Dyer, 2003). But the long-term advantages of HTML as well-formed XML are enormous. Until XHTML, HTML was only human-readable; now it is, to some extent, machine-readable (even without the unlikely success of the Semantic Web project). XML was built to be processed automatically and to enable the easy creation of utilities using standard tools. The Web site production process is increasingly taking advantage of this. XSLT is now widely used to repurpose content for different output devices and for on-the-fly transformation of results from server-side XML databases. Other XML tools are coming to the fore as well; it is likely, for example, that XPath will become central in content management systems for the selection of fragments of content. And producing output from XML-based data does not even require delivery of XHTML. It can always be dumbed down to HTML 4, although the latter is not suitable for small form factor devices such as PDAs and mobiles. The push towards XHTML will come from the big publishers, those maintaining and delivering huge amounts of data, because they

102

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

will want to repurpose for different output devices. There is an enormous growth in small form factor devices. But, while desktop computers and servers are becoming increasingly powerful, the same is not true for these small devices. The progress in battery technology has not mirrored that of chip technology. Mobile device resources will remain limited for the foreseeable future. And well-formed XML requires considerably fewer resources for parsing and display than HTML 4. As Bray (2003) comments, peoples’ home pages are not going to change to XHTML any time soon. But that does not mean that there are no advantages of HTML as well-formed XML for smaller sites and businesses. XML means standard tools and, hence, simpler creation of small as well as large applications. The move from HTML 4 to XHTML 1.0 is worth it for almost everyone. Modularisation Mature modularisation is probably the only significant advantage of XHTML 2.0 over its predecessor. But what an advantage! If modularisation works and is adopted, it could transform the way Internet applications are built and lead to added innovation in the market place. HTML is being used across an increasingly diverse range of devices including small form factor devices, television devices, even fridges. Content needs to be tailored for all of these and more in an economically feasible way. Modularisation makes this possible because it provides a standard method for a device to indicate which elements it can support and for a Web site to specify which elements it requires for delivery of its content. Modularisation also means that developers are not constrained by the set of tags defined by the HTML Working Group; everyone is given the ability to incorporate domain-specific vocabularies in XHTML documents. It may also provide a way out of the stranglehold that Microsoft has developed over browser technology. It is painfully obvious that a few key players control everything on the desktop at present. If they do not move, nothing changes. IE has developed little in the last few years and, if Microsoft’s plans are realised, it will stay the

same until some time ‘‘between late 2006 and mid-2008’’ (Kovar, 2003) when IE 7 is released along with Longhorn, the next version of the operating system. That means nothing will really change until between 2008 and 2010 when the new browser has been widely adopted. And IE 7 would not be available as a standalone browser; it will only be accessible as part of an operating system upgrade. In the meantime, the evolution of HTML – as it is used, not as it is described in a W3C specification – is constantly held back as we wait for the big browsers to implement some feature or other. And many potential applications never get written because the only way to implement them would be to develop a new browser from scratch. To enable any sort of change, a way must be found for small groups of open source or commercial developers to build on what has been done already and to add their own new features and applications. This is possible only to a limited extent with plugins. XHTML modularisation should make the process much easier. If the key browsers, particularly IE, Mozilla and Safari, support XHTML modularisation, the Web will be opened up to innovation. Mozilla has shown the way; it has been rebuilt around a modular core and a suite of extensions. For example, an extension has already been written to display MathML. And the Minimo (Mini Mozilla) project (www.mozilla.org/projects/minimo), in which a slimmed-down version of the Mozilla browser will be available for small devices such as palmtops, is a direct result of modularisation. Pemberton comments: . . . there are some things coming up that make me think IE will evolve into a shell that won’t be doing much processing itself. It’ll just be a way of combining different markup languages via plugins (Dyer, 2003).

It is not clear whether Microsoft is planning to adopt the W3C version of modularisation for IE. Even if it is not, all is not lost. Obviously, the more native the browser support for a feature, the more possibilities there are. But much of XHTML modularisation can be implemented outside IE, using third party tools that use IE for rendering, such as the XForms plugin, FormsPlayer (www.formsplayer.com/).

103

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

Of course, modularisation is a double-edged sword; if you allow anyone to add any modules they like, they will – and HTML loses its status as the lingua franca of the Web. There has to be a balance between enabling new applications to be built and having so many implementations that there is no interoperability. XHTML 2 provides the wherewithal for developers large and small to write their own customised modules while also providing standard modules. Could modularisation re-democratise HTML? That depends on how you define democracy. Watts (2000) sees it as: . . . splitting the Web into the ‘‘haves’’ (those organisations who are XML-literate) and the ‘‘have nots’’ (who are frozen with using HTML and scripting languages).

To exploit XHTML modularisation in the ‘‘near future requires a fairly significant grasp of many XML concepts and technologies’’ which some small organisations would not have. This is only true if content authors hand-coded HTML – and that was not even true in 2000. You do not need any knowledge of XML if your word processor generates it for you. Since Watts made these comments, XML has become the native file format of Star Office and is also supported by Microsoft Office II. Yes, the related standards in the XML family such as XSLT do involve a steep learning curve. But the application of modularisation does not require any XML standard except XML. Modularisation will enable tool providers and open source software developers to create new and powerful tools which will aid both the ‘‘haves’’ and the ‘‘have nots’’. Coercive purism for the sake of academic coherence is unnecessary. But modularisation need not be coercive and it is not purist.

Pragmatism The HTML WG is reported to be unconcerned about the lack of backward compatibility in XHTML 2.0 (W3C, 2003a): Thanks to XML and stylesheets, . . . strict element-wise backwards compatibility is no longer necessary.

The assumption is that XHTML will only be used for new documents and applications; no one will bother to convert legacy data or update existing applications. However, this makes for a very uncomfortable transition phase. Old content that needs one set of applications exists in parallel with new content that needs another set. Thus, device code becomes longer and more complex, a major issue for mobile devices. XHTML 2 is a ‘‘member of the XHTML Family of markup languages’’ (W3C, 2003a) but it is not, and is not intended to be, the same language as XHTML 1.1. At the same time, ‘‘XHTML 2 [has the] same name as the language it is not replacing, the same MIME type and the next major version number’’ (Pilgrim, 2003). This makes it almost impossible to tell what version of XHTML a document is written in until it has been downloaded and parsed. It also makes it impossible to enable HTTP content negotiation, in which different versions of the same document are supplied depending on the capabilities of the device. Breaking backward compatibility hinders the adoption of new standards. The marketplace is largely pragmatic. If it perceives that it is getting something new, it may move towards it. But if the perception is that it is losing something important, this is much less likely. The perception with XHTML 2.0 is that some of the Web’s cornerstone features are being lost. But the Web has never been solely defined by W3C standards; it is also defined by what is implemented. And all previous versions of HTML have been implemented in a way that is more pragmatic than the standard itself. The marketplace will look for a pragmatic solution. For XHTML 2.0, this could include support for some of the features that have been dropped in the move from XHTML 1.1. The W3C could pre-empt unrest by proposing a transitional version of XHTML 2.0 which reintroduces some features of XHTML 1.1 whilst retaining the advantages of modularisation and well-formed XML. The purist quest to remove all presentational aspects from HTML has been a distraction. XHTML 2.0 is in danger of being represented, not as the version in which modularisation became mature and hence opened the Web to

104

Whither HTML?

Library Hi Tech Volume 22 . Number 1 . 2004 . 99-105

Judith Wusteman

innovation, but as the version that got rid of the img tag.

References W3C (2000a), XHTMLTM 1.0 The Extensible HyperText Markup Language, 2nd ed., W3C Recommendation, 26 January (revised 1 August), available at: www.w3.org/TR/xhtml1/ Bray, T. (2003), ‘‘Web architectures and XML’’, Trinity College, Dublin, 19 November. Chase, N. (2002), ‘‘The Web’s future: XHTML 2.0’’, IBM developerWorks, Web Architecture XML Zone, September, available at: www-106.ibm.com/ developerworks/web/library/wa-xhtml/ Dyer, R. (2003), ‘‘The XML.com interview: Stephen Pemberton’’, 21 May, available at: /www.xml.com/ pub/a/2003/05/21/pemberton.html Glazman, D. (2002), ‘‘Comments on 2002-12-12 XHTML 2.0 WD‘‘, 18 December, available at: http://lists.w3.org/ Archives/Public/www-html/2002Dec/0113.html Kovar, J.F. (2003), ‘‘Gartner: Longhorn delays will affect Windows upgrades’’, Information Week, 11 December,

available at: www.informationweek.com/story/ showArticle.jhtml?articleID=16700197 Pilgrim, M. (2003), ‘‘The vanishing image: XHTML 2 migration issues’’, XML.com, 2 July, available at: www.xml.com/pub/a/2003/07/02/dive.html W3C (2000b), XHTMLTM Basic, W3C Recommendation, 19 December, available at: www.w3.org/TR/xhtmlbasic/ W3C (2001a), Modularisation of XHTMLTM, W3C Recommendation, 10 April, available at: www.w3.org/ TR/xhtml-modularization/ W3C (2001b), XHTMLTM 1.1 – Module-based XHTML, W3C Recommendation 31 May, available at: www.w3.org/ TR/xhtml11/ W3C (2003a), XHTMLTM 2.0, W3C Working Draft, 6 May, available at: www.w3.org/TR/xhtml2/ W3C (2003b), ‘‘FAQ on US Patent 5,838,906 and the W3C’’, available at: www.w3.org/2003/09/public-faq Watts, A. (2000), ‘‘HTML 4.0, XHTML 1.0 and XHTML 1.x – splitting the Web?’’, 4 October, available at: http:// groups.yahoo.com/group/XHTML-L/message/1050 Wusteman, J. (2003), ‘‘Web forms: the next generation’’, Library Hi Tech, Vol. 21 No. 3.

105

Property is theft.

On copyright Copyright in a networked world: ethics and infringement

(Proudhon, 1994) The property which every man has in his own labour . . . is the most sacred and inviolable. (Smith, 1976) The categorical imperative . . . is this: act according to a maxim that can at the same time be valid as a universal law. (Kant, 1965)

Michael Seadle

Introduction

The author Michael Seadle is Editor of Library Hi Tech and Copyright Librarian and Assistant Director for Systems and Digital Services, Michigan State University, East Lansing, Michigan, USA. Keywords Intellectual property, Copyright law, Statute law, Ethics Abstract The statutes themselves are not the only basis for deciding whether an intellectual property rights infringement has occurred. Ethical judgments can also influence judicial rulings. This column looks at three issues of intellectual property ethics: the nature of the property, written guidelines for behavior, and enforcement mechanisms. For most active infringers digital property seems unreal, the rules ambiguous, and the enforcement statistically unlikely. Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 106-110 # Emerald Group Publishing Limited . ISSN 0737-8831 DOI 10.1108/07378830410524620

What are the ethical rules for intellectual property? Is claiming exclusive ownership in creative works a sacred and inviolable right, or is it the theft of property that ought to be freely available? Should the privileges of copyright holders have the force of a universal law, or should that law give precedence to the rights of those using works for educational and scholarly purposes? The statutes themselves are not the only basis for deciding whether an intellectual property rights infringement has occurred. Ethical judgments can influence judicial rulings about whether or not an infringement was sufficiently ‘‘willful’’ to add extra penalties. Ethical judgments about intellectual property can also influence jobs, grades, and the respect that one professional has for another. Plagiarism, for example, is a universally condemned form of intellectual property theft within academic society. Its rules allow no fair use exclusion for short passages like those quoted (and footnoted) above, and its duration has no limit. Authors like Proudhon, Smith, and Kant retain their rights to attribution well beyond the statutory life-plus-70 rule. Faculty members can lose jobs because of plagiarism, and plagiarism is an unexceptional reason for failing a student. Ethical rules are socially constructed norms that can vary from segment to segment of contemporary society. Presumably the students who shared music using Napster and now KaZaA do not share the sentiments of Jack Valenti, president of the Motion Picture Association of America, who argued that: Received 21 November 2003 Revised 26 November 2003 Accepted 5 December 2003

106

Copyright in a networked world: ethics and infringement

Library Hi Tech Volume 22 . Number 1 . 2004 . 106-110

Michael Seadle

substantial numbers of copyright holders who license their works for broadcast on free television would not object to having their broadcasts time-shifted by private viewers. And second, respondents failed to demonstrate that time-shifting would cause any likelihood of non-minimal harm to the potential market for, or the value of, their copyrighted works. The Betamax is, therefore, capable of substantial non-infringing uses. Sony’s sale of such equipment to the general public does not constitute contributory infringement of respondents’ copyrights[1].

. . . unless we control piracy, one of America’s greatest creative and business enterprises could be in incredible peril (Green, 2003).

The students are not criminals who would take a physical CD from a music store. In their ethical world view, sharing is different. This column will look at three issues of intellectual property ethics: (1) the nature of the property; (2) written guidelines for behavior; and (3) enforcement mechanisms.

Types of theft Both law and society have well-developed ethical standards regarding physical objects. While the law has attempted to extend the same rules to digital versions, it remains difficult for many to believe that copying a stream of bits that in no way lowers the quality or availability of the original constitutes theft, since nothing has visibly been taken from the owner. In other words, theft for many means removal. Ethically legitimate instances of copying are so ubiquitous that they seem invisible. Every art student begins by copying the drawings, paintings, or sculptures of others. Every musician begins by trying to imitate the sounds of teachers and performers. Every programmer begins by copying simple code, such as the famous ‘‘Hello World’’. Much of this copying is imperfect, not from an attempt to invest the new work with originality, but from simple inability to extract (for example) exactly the same sound from a violin as Itzhak Perlman. The copying is for educational purpose or personal use, not commercial gain, but the copy might well be shared with friends via a music performance or given away to friends, if a work of art. The owner of the original loses nothing through any of these copies. The Sony decision reinforced this ethical rule by making it legitimate to make private copies of television shows for later use, repeated use, or even sharing with friends: In summary, the record and findings of the District Court lead us to two conclusions. First, Sony demonstrated a significant likelihood that

The broadcaster lost no physical copy and no ability to rebroadcast and, with reasonably well-functioning video-recorders, the copy resembled the original enough for almost no one to care whether they saw a live broadcast or a taped version some days or hours later. TiVo technology offers a situation even closer to copying music, because the recording takes place in digital form on a hard disk, not as an analog stream on easily-stretched tape, where the deterioration with each generation of copying becomes evident. In fact broadcasters themselves use tools similar to TiVo to queue and manage programming. If saving broadcasts on TiVo is ethical, and if showing a saved program to a group of friends is legitimate, the boundary between that and sharing a copy with a friend not in the room at the time quickly grows indistinct. In the TV series Star Trek: The Next Generation the computer on the Enterprise routinely produced plates of food. While the food lacked copyright protection, the design of the plates and cups would meet the originality minima for US copyright protection today. Perhaps Star Fleet had an unmentioned site-license for copying that design, but the implication seemed to be that anyone could make as many copies of the dishes as they wanted without legal or ethical consequences. Nor was it clear what limits the Enterprise’s computer had on copying. It could probably make a paper copy of a Klingon book or a CD of a hit Vulcan song. The apparently unrestricted copying by the Enterprise computer served as subliminal propaganda for a generation growing up in a network-connected world.

107

Copyright in a networked world: ethics and infringement

Library Hi Tech Volume 22 . Number 1 . 2004 . 106-110

Michael Seadle

Written guidelines

research, is not an infringement of copyright (Cornell Univesity, 2003).

Many colleges establish written ethical rules and require students to sign a social contract that specifies rules of conduct. Plagiarism is traditionally the form of intellectual property violation that gets the most attention. At George Mason University (1994), for example, the Honor Code defines plagiarism as: 1. Presenting as one’s own the works, the work, or the opinions of someone else without proper acknowledgement. 2. Borrowing the sequence of ideas, the arrangement of material, or the pattern of thought of someone else without proper acknowledgement.

As the risk of liability grows, institutions may also begin to detail ethical rules for copyright. One class at George Mason has added this extension to the honor code that deals specifically with copyright: I understand that I need written permission from the copyright holder for any graphics, photos, animations, or other materials that I put on my Web site. I understand that I must attribute the source of all graphics and that citing the source of a graphic does not absolve me from the copyright rules (George Mason University, 2003).

The Simmons Graduate School of Library and Information Science (2003) also has an explicit reference to copyright in their honor code: Violating copyright law (Title 17, United States Code, Section 101, et seq.). Students should pay particular attention to section 107, which allows photocopying of copyrighted materials under the guidelines of ‘‘fair use’’ and to section 108, which describes some of the photocopying regulations in academic libraries.

These are just a few examples that appeared at the top of a Google search for ‘‘honor code’’ and ‘‘copyright’’. Including copyright explicitly in honor codes is still relatively rare. Because no consensus exists on how the copyright rules should explain what is, and is not, fair use, institutions within the USA tend to fall back on the language of the fair use statute itself. The preamble to Section 107 suggests a very broad scope for the ethical use of copyright-protected materials: . . . the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or

Students, faculty, and others who read this preamble outside the context of the statute’s four tests that form the basis of most case law generally feel fully justified in any use that fits into those broad exclusions. This includes putting protected materials on a freely accessible Web site, as long as the intention arguably falls within the list, regardless of whether that use has consequences for the rights holder. Consequences do, however, matter for ethics, and the fourth test in the statute offers a way to measure the consequences of fair use. It says that, for a use to be fair, consideration must be given to: . . . the effect of the use upon the potential market for or value of the copyrighted work (Cornell University, 2003).

The justices in the Michigan Documents decision considered that this test ‘‘is at least primus inter pares’’ and considered it first in deciding the case[2]. As an ethical guideline, the fourth test appears to empower the use of works either that are out of print and have little apparent potential for being republished, or that have never been sold and seem unlikely ever to be sold. Since a work with no known market value cannot suffer a loss, no one should be losing money. But proving that no potential market exists can be hard. An angry rights holder could argue that new uses such as online databases have created vast new untapped markets for unsold, unpublished, and out-of-print works, and a court could (potentially) agree. The fact is that a reasonable fair use in ethical terms could still be an infringement in strict legal judgment. Whether the risk is worth taking depends on enforcement.

Enforcement mechanisms The enforcement of copyright law is even less certain than being caught for speeding on the highway, since there are no official, state-authorized copyright police out looking for violators. The law leaves enforcement to the

108

Copyright in a networked world: ethics and infringement

Library Hi Tech Volume 22 . Number 1 . 2004 . 106-110

Michael Seadle

rights owners or those owners’ legal representatives. Lately some rights owners groups in the recording industry have begun a kind of vigilante process to find violators: Last spring industry groups sued individuals and specific institutions in an attempt to force administrators to provide the names of college students who had used campus networks to download or distribute digital content – both music and movies (Green, 2003).

Only well-organized groups can afford to hire staff to undertake this kind of action in order to find instances of infringement, and to bring suit in a Federal Court. While it can be difficult for individual consumers to pay the court costs to defend themselves, they have some advantages when they can reasonably claim a noncommercial fair use, because (according to the Appeals Court in the Michigan Documents case) the: . . . burden of proof as to market effect rests with the copyright holder if the challenged use is of a ‘‘non-commercial’’ nature[2].

While attempts at legal enforcement have captured most of the headlines, it is peer pressure based on ethical judgments that enables or discourages most intellectual property infringement. Peer pressure varies significantly depending on whether the use is educational or purely personal. At most educational institutions, faculty favor a very aggressive form of fair use over more risk-adverse administrative guidelines. Any educational or scholarly use seems like justifiable fair use, especially when the prices that rights owners charge for access seem outrageously high. Administrative pressures have made faculty more aware of copyright issues, but two problems frustrate efforts to increase compliance. The first is the difficulty in getting permission for an educational or scholarly use. The Copyright Clearance Center can help with standard, currently published materials from major publishing houses for a cost that is far from trivial. Non-standard materials require more research, patience, and time to track down than most people possess. The second is the lack of an acceptable economic model for paying the royalties. Universities already license large numbers of electronic resources (over 30,000 titles at

Michigan State University, for example), and few funds remain to cover other items. Paper-based course packs have long shifted the burden of paying royalties to students, but faculty balk at charging for something they could legally just have put in paper version on a library reserve – and students balk equally at paying. Within a password-protected course management system like Blackboard or Angel, the potential damage to any rights holder seems small, and the likelihood of the infringement being discovered seems even smaller. In other words, enforcement tends to be limited to a small group of colleagues and students with similar viewpoints and economic motives, and to personal ethical judgments. Harper’s (2001) Copyright Crash Course warns that penalties can be up to $150,000 per infringement in cases of willful infringement, but the reality is that no one knows of any cases where faculty members or students were convicted of infringement for a genuine educational use involving no monetary gain. What happens inside the password-protected space of a course management system seems much like what happens in the privacy of one’s home, where Americans feel anything goes. Private use of entertainment works is different. University administrations have made strenuous efforts to make students aware of penalties for music and video file sharing, and have cut off network access to first time violators, which seriously disadvantages them in their course work. Other penalties can include dismissal from the university. Many students remain unconvinced that file sharing is an ethical (or legal) problem, and few educational institutions want or have mechanisms in order to invade the privacy of network use to find out what they are doing with their own private machines. The likelihood of being caught has increased dramatically recently, but still remains fairly low. The number of people with Internet access outside the academic community has grown enormously, and ways of influencing their ethical judgments and altering their behavior are harder to find. University-based infringers are currently a tempting target for rights holders but, if speeding on the highway is any

109

Copyright in a networked world: ethics and infringement

Library Hi Tech Volume 22 . Number 1 . 2004 . 106-110

Michael Seadle

indication, public behavior will change little as long as people believe that file sharing does no real harm, except to an industry famous for its wealth.

Notes 1 Sony Corporation of America et al. v. Universal City Studios (17 January 1984), United States Supreme Court. 2 Princeton University Press v. Michigan Document Services (8 November 1996), US Court of Appeals for the Sixth Circuit.

Conclusion At best ethical judgments provide rules for navigating between the lee shore of literal copyright compliance and the wild seas of unrestrained fair use. Sail near the shore is the universal law of those making money from intellectual property, and brave the seas is the mantra of intellectual property consumers. Each side can invoke Kant’s categorical imperative for their own defense. The challenge for libraries and academic communities is to find a middle way or, in Hegelian terms, a synthesis. In social terms the debate has less to do with the law than with ethical understandings about the nature of intellectual property, the written rules, and the social mechanisms for enforcement. For most active infringers digital property seems unreal, the rules ambiguous, and the enforcement statistically unlikely. Such attitudes are difficult to change. Since enforcement is expensive and its invasiveness contrary to the principles of US society, those wanting change may find it more cost-effective: . to concentrate on altering ideas about the real nature of intellectual property; and . to offer balanced rules-of-thumb that do more than merely repeat an already ambiguous legal code.

References Cornell University (2003), United States Code, 17 USC 107A, Title 17, Chapter 1, section 107, available at: www4.law.cornell.edu/uscode/17/107.html (accessed November). George Mason University (1994), Faculty Handbook, Appendix D, available at: www.gmu.edu/facstaff/ handbook/aD.html (accessed November). George Mason University (2003), Honor Code and Copyright Pledge, available at: http://classweb. gmu.edu/ nclc348/f03/pledge.html (accessed November). Green, K. (2003), ‘‘There’s too much hyberbole in the download debate’’, Chronicle of Higher Education, available at: http://chronicle.com/weekly/v50/i13/ 13b01601.htm (accessed November). Harper, G.K. (2001), Copyright Crash Course, available at: www.utsystem.edu/OGC/intellectual property/ cprtindx.htm#top (accessed November). Kant, I. (1965), The Metaphysical Elements of Justice, Library of the Liberal Arts, New York, NY, p. 26 (originally published in 1797). Proudhon, P.J. (1994), Qu’est-ce que la Proprie´te´?, tr. by Kelley, D.R. and Smith, B.G., Cambridge University Press, New York, NY (originally published in 1890). Simmons Graduate School of Library and Information Science (2003), Honor Code, available at: www.simmons.edu/gslis/students/honorcode.html (accessed November). Smith, A. (1976), An Inquiry into the Nature and Causes of the Wealth of Nations, Book 1, Cannan, E. (Ed.), University of Chicago Press, Chicago, IL, p. 136 (originally published in 1776).

110

.

Note from the publisher .

We retained our status as an ISO 9000certified organisation, and our Investors in People (IIP) certification. More than 30 Emerald journals are listed in the ISI Citation Index.

2 Continuous improvement of reader, author and customer experience

During 2003 Emerald developed its corporate publishing philosophy. We did this through discussion with readers, contributors and editors and we would like to share it with you. We believe that our approach to quality makes us different and unique amongst scholarly publishers. It is based on six core principles, which together form our distinctive philosophy.

1 We put quality at the centre of our approach to scholarly publishing All papers published by Emerald go through a quality-assured peer review system; in many cases this takes the form of double-blind peer review. All papers published by Emerald are expected to make, in some way, an explicit original contribution to the existing body of knowledge. All papers published by Emerald are accessible to a wide range of students, scholars and practitioners in the fields in which we publish. All papers published by Emerald are beneficial in some way, to researchers, practitioners, or both: . In 2001 we were audited and certified as ‘‘Committed to Excellence’’ following a European Foundation for Quality Management self-assessment exercise. Library Hi Tech Volume 22 . Number 1 . 2004 . pp. 111-112 # Emerald Group Publishing Limited . ISSN 0737-8831

We continue to invest in enabling technology to increase efficiency and effectiveness in content provision, customer service and management. We benchmark against others and against our own standards. We are as clear as possible in our policies, measures, targets and achievements and we do not hide shortfalls, but confront and learn from them: . Emerald papers go through a further post-publication ‘‘review’’, which assesses them on readability, originality, implications for further research and practice. We publish this information, and it can be used as a search criterion, on the Emerald database. . In 2002 we were judged as providing Best Customer Support by the scholarly library publication The Charleston Advisor. . We provide high levels of dissemination of our authors’ work – nearly one million papers per month are downloaded and read by subscribers to the Emerald online portfolio. . In 2004 we will be introducing an online submission and peer review system, which will speed up the publication process.

3 Internationality We operate in a transnational world of scholarly ideas and we believe that this should be reflected within our publications. Working with our authors, we set targets for international representation of authors and editorial team members, and measure against them: . In the past six months we have published more than 50 themed issues with a specific international focus. . In the first half of 2003, papers from 60 different countries were published.

111

Note from the publisher

Library Hi Tech Volume 22 . Number 1 . 2004 . 111-112

4 An interdisciplinary approach We set targets, and ask, for papers and special issues on interdisciplinary approaches, and new/emergent themes. This gives us better, stronger, and more vibrant journals, and a clear leadership position in our industry: . In the first half of 2003 we published more than 20 themed issues dealing with interdisciplinary approaches to a subject or industry. . We encourage themed issues on leading edge and innovative research topics, and in the past six months published 35 such issues.

.

.

gain advanced recognition among their peers by publicising their research at the earliest opportunity. Each year, we distribute grants to researchers working on improving the scholarly publishing dissemination process, and to encouraging scholarship in the developing world. We conducted research workshops at 11 universities and conferences, had papers accepted at nine academic and other conferences, and supported six academic conferences world-wide.

6 Integration of theory and practice 5 Supporting scholarly research: the Literati Club We help remove the barriers to publication. We conduct workshops for researchers on publishing issues. We provide help and advice to new researchers. We offer a service for authors whose first language is not English. Our staff regularly present papers at conferences on scholarly publishing themes: . Our scholarly community Web site, the Literati Club, disseminates information about how to write for publication more successfully – we seek to make the process more transparent. . The Emerald Research Register, an online forum for the circulation of pre-publication information, is designed to help researchers

We ask editors and review board members to focus where applicable on application, and beneficial implication for practice. We do so because this gives a clear message to our core supplier and consumer markets – the applied researcher, the reflective practitioner, faculty and students: . All our journals will publish a majority of papers that have a direct application to the world of work. . More than 1,000 university libraries world-wide subscribe to the Emerald portfolio, including 97 per cent of the Financial Times Top 100 Business Schools. John Peters Director of Author Relations Emerald

112