Open Source Software 9781845448776, 9781845448769

Every so often one needs to take a slightly different path and see how the other halflives and works. That is what I did

324 37 4MB

English Pages 165 Year 2005

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

How Open Source Ate Software: Understand the Open Source Movement and So Much More 9781484268001

1,001 136 3MB Read more

Free and open source GIS software: educational manual. 9786010410343

This educational manual had been developed in accordance with state educational standards and core curriculum of «6M0609

1,009 113 5MB Read more

How Open Source Ate Software: Understand The Open Source Movement And So Much More [2nd Edition] 1484267990, 9781484267998, 9781484268001

Learn how free software became open source and how you can sell open source software. This book provides a historical co

1,778 195 4MB Read more

Producing Open Source Software How to Run a Successful Free Software Project

834 41 1MB Read more

Producing Open Source Software: How to Run a Successful Free Software Project 0596007590, 9780596007591, 9780596552992, 0596552998

The corporate market is now embracing free, "open source" software like never before, as evidenced by the rece

915 106 2MB Read more

Open Source Democracy 9781078720441

263 19 134KB Read more

Open Source AI 9798223355366

In this groundbreaking book, renowned AI expert Adam Smith delves into the world of open-source artificial intelligence

238 81 224KB Read more

Open Source Architecture 9780500343067

A provocative look at the architecture of the future and the challenges of learning from the pastOpen Source Architectur

1,049 175 1MB Read more

Open Source Projects - Beyond Code: A blueprint for scalable and sustainable open source projects 1837636885, 9781837636884

Accelerate your career and make an impact by launching and running a successful open source project. Purchase of the pri

486 20 32MB Read more

Open Source Licensing: Software Freedom And Intellectual Property Law [1st Edition] 0131487876, 9780131487871

Now that open source software is blossoming around the world, it is crucial to understand how open source licenses work

550 97 4MB Read more

Open Source Software
9781845448776, 9781845448769

Author / Uploaded
Scott P. Muir
Mark Leggott

Citation preview

lht cover (i).qxd

30/11/2005

09:36

Page 1

ISBN 1-84544-876-6

ISSN 0737-8831

Volume 23 Number 4 2005

Library Hi Tech Open source software Theme Editors: Scott P. Muir and Mark Leggott

www.emeraldinsight.com

Library Hi Tech

ISSN 0737-8831 Volume 23 Number 4 2005

Open source software Theme Editors Scott P. Muir and Mark Leggott

Access this journal online _________________________

463

Editorial advisory board __________________________

464

GUEST EDITORIAL An introduction to the open source software issue Scott P. Muir__________________________________________________

465

THEME ARTICLES Experiments in academic social book marking with Unalog Daniel Chudnov, Jeffrey Barnett, Raman Prasad and Matthew Wilcox____

469

Archime`de: a Canadian solution for institutional repository Rida Benjelloun ________________________________________________

481

dbWiz: open source federated searching for academic libraries Calvin Mah and Kevin Stranack __________________________________

490

Open Journal Systems: an example of open source software for journal management and publishing John Willinsky _________________________________________________

Access this journal electronically The current and past volumes of this journal are available at:

www.emeraldinsight.com/0737-8831.htm You can also search more than 100 additional Emerald journals in Emerald Fulltext (www.emeraldinsight.com/ft) and Emerald Management Xtra (www.emeraldinsight.com/emx) See page following contents for full details of what your access includes.

504

CONTENTS

CONTENTS continued

Using open source to provide remote patron authentication Jackie Wrosch _________________________________________________

520

Creating and managing XML with open source software Eric Lease Morgan _____________________________________________

526

Creating digital library collections with Greenstone Ian H. Witten and David Bainbridge_______________________________

541

OTHER ARTICLES Taking pro-action: a survey of potential users before the availability of wireless access and the implementation of a wireless notebook computer lending program in an academic library Hugh A. Holden and Margaret Deng ______________________________

561

A statewide metasearch service using OAI-PMH and Z39.50 Joanne Kaczmarek and Chew Chiat Naun___________________________

576

Similar interest clustering and partial back-propagation-based recommendation in digital library Kai Gao, Yong-Cheng Wang and Zhi-Qi Wang ______________________

587

Lessons learned from analyzing library database usage data Karen A. Coombs ______________________________________________

598

Using screen capture software for web site usability and redesign buy-in Susan Goodwin ________________________________________________

610

Awards for Excellence ____________________________

622

Note from the publisher ___________________________

623

www.emeraldinsight.com/lht.htm As a subscriber to this journal, you can benefit from instant, electronic access to this title via Emerald Fulltext and Emerald Management Xtra. Your access includes a variety of features that increase the value of your journal subscription.

Additional complimentary services available

How to access this journal electronically

E-mail alert services These services allow you to be kept up to date with the latest additions to the journal via e-mail, as soon as new material enters the database. Further information about the services available can be found at www.emeraldinsight.com/alerts

To benefit from electronic access to this journal you first need to register via the internet. Registration is simple and full instructions are available online at www.emeraldinsight.com/admin Once registration is completed, your institution will have instant access to all articles through the journal’s Table of Contents page at www.emeraldinsight.com/0737-8831.htm More information about the journal is also available at www.emeraldinsight.com/ lht.htm Our liberal institution-wide licence allows everyone within your institution to access your journal electronically, making your subscription more cost-effective. Our web site has been designed to provide you with a comprehensive, simple system that needs only minimum administration. Access is available via IP authentication or username and password.

Key features of Emerald electronic journals Automatic permission to make up to 25 copies of individual articles This facility can be used for training purposes, course notes, seminars etc. This only applies to articles of which Emerald owns copyright. For further details visit www.emeraldinsight.com/ copyright Online publishing and archiving As well as current volumes of the journal, you can also gain access to past volumes on the internet via Emerald Fulltext and Emerald Management Xtra. You can browse or search these databases for relevant articles. Key readings This feature provides abstracts of related articles chosen by the journal editor, selected to provide readers with current awareness of interesting articles from other publications in the field. Reference linking Direct links from the journal article references to abstracts of the most influential articles cited. Where possible, this link is to the full text of the article. E-mail an article Allows users to e-mail links to relevant and interesting articles to another computer for later use, reference or printing purposes. Emerald structured abstracts New for 2005, Emerald structured abstracts provide consistent, clear and informative summaries of the content of the articles, allowing faster evaluation of papers.

Your access includes a variety of features that add to the functionality and value of your journal subscription:

Connections An online meeting place for the research community where researchers present their own work and interests and seek other researchers for future projects. Register yourself or search our database of researchers at www.emeraldinsight.com/ connections User services Comprehensive librarian and user toolkits have been created to help you get the most from your journal subscription. For further information about what is available visit www.emeraldinsight.com/usagetoolkit

Choice of access Electronic access to this journal is available via a number of channels. Our web site www.emeraldinsight.com is the recommended means of electronic access, as it provides fully searchable and value added access to the complete content of the journal. However, you can also access and search the article content of this journal through the following journal delivery services: EBSCOHost Electronic Journals Service ejournals.ebsco.com Informatics J-Gate www.j-gate.informindia.co.in Ingenta www.ingenta.com Minerva Electronic Online Services www.minerva.at OCLC FirstSearch www.oclc.org/firstsearch SilverLinker www.ovid.com SwetsWise www.swetswise.com

Emerald Customer Support For customer support and technical help contact: E-mail [email protected] Web www.emeraldinsight.com/customercharter Tel +44 (0) 1274 785278 Fax +44 (0) 1274 785204

LHT 23,4

464

EDITORIAL ADVISORY BOARD

Morell D. Boone Professor, Interdisciplinary Technology, Eastern Michigan University, Ypsilanti, MI, USA

Steve O’Connor Chief Executive Officer, Caval Collaborative Solutions, Bundoora, Victoria, Australia

Michael Buckland University of California, Berkeley, CA, USA May Chang North Carolina State University, Raleigh, North Carolina, USA

Ed Roberts Head of Information Systems, University of Washington Health Sciences Libraries, USA

Susan Cleyle Associate University Librarian, QEII Library, Memorial University of Newfoundland, Canada Timothy W. Cole Mathematics Librarian and Associate Professor of Library Administration, University of Illinois at Urbana-Champaign, USA Dr Colin Darch Centre for Information Literacy, University of Cape Town, South Africa Professor G.E. Gorman School of Communications & Information Management, Victoria University of Wellington, New Zealand Charles Hildreth Associate Professor, Long Island University, Brookville, NY, USA Larry A. Kroah Director, Trenton Free Public Library, NJ, USA Karen Markey University of Michigan, Ann Arbor, MI, USA Joe Matthews EOS International, Carlsbad, CA, USA

Library Hi Tech Vol. 23 No. 4, 2005 p. 464 # Emerald Group Publishing Limited 0737-8831

Ilene Rockman Manager, Information Competence Initiative, The California State University, Hayward, CA, USA Professor Jennifer Rowley Lecturer, School for Business and Regional Development, University of Wales, Bangor, UK James Rush Consultant, PA, USA Dr Hildegard Schaffler Head of Serials and Electronic Media, Bavarian State Library, Munich, Germany Axel Schmetzke Librarian/Assistant Professor, University of Wisconsin-Stevens Point, WI, USA Steven Sowards Head Main Library Reference, Michigan State University, MI, USA Dr Judith Wusteman Department of Library and Information Studies, University College Dublin, Ireland Sandra Yee Dean of University Libraries, David Adamany Undergraduate Library, Wayne State University, Detroit, MI, USA

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

GUEST EDITORIAL

Guest editorial

An introduction to the open source software issue Scott P. Muir Eastern Michigan University, Ypsilanti, Michigan, USA

465 Received 14 July 2005 Revised 1 September 2005 Accepted 2 September 2005

Abstract Purpose – To introduce the Library Hi Tech theme issue on open source software. Design/methodology/approach – At the Hackfest before Access 2004 (a Canadian library technology conference) several people started to code open source software (OSS) solutions. Findings – Some groups estimated they were close to 25 percent done – in just a few days, while attending Access Conference sessions all day. Originality/value – Developments in the OSS library community should encourage you to experiment with these applications, or maybe even develop your own. Keywords Libraries, Computer software Paper type Viewpoint

Every so often one needs to take a slightly different path and see how the other half lives and works. That is what I did when I attended Access 2004, a Canadian library technology conference. I had heard good things about this conference, and adding to my interest was the preconference on Institutional Repositories that tied in with a project I was to address in my current job. On the same day as the preconference was something called a Hackfest, which, unfortunately I had to miss. Little did I realize how fascinating the Hackfest could be! At Hackfest a group of people get together and review a list of technology problems or needs have been submitted by libraries. The participants select a problem they want to work on and breakout into small groups where they spend the remainder of the day focusing on solutions. I was stunned at the end of the conference to hear the reports from Hackfest and see that several people had actually begun to code an Open Source Software (OSS) solution to their topic. Some groups estimated they were close to 25 percent done – in just a few days, while attending Access Conference sessions all day. Additionally, throughout this conference, I heard the names of other programs such as CUFTS/GODOT that had been developed and were in use in Canadian libraries, some of them outcomes of previous Hackfests. An article about the Hackfest is available online (http://wiki.uwinnipeg.ca/index.php/LoomWareWiki:Hackfest). I was awed at what I saw and heard, and I wondered why I didn’t know what was going on in the libraries of our neighbors to the north. I was also fascinated that instead of griping about the problem or complaining about the vendor not delivering the ideal solutions, these librarians and programmers had set about solving the problem. This led to the idea that an issue of Library Hi Tech describing what is happening with OSS q Scott P. Muir

Library Hi Tech Vol. 23 No. 4, 2005 pp. 465-468 Emerald Group Publishing Limited 0737-8831

LHT 23,4

466

applications in universities and libraries (particularly Canadian institutions) was worth sharing. In addition to developments in Canada, we also have examples from the USA and one from New Zealand. There is a wealth of information on OSS applications in libraries and in other settings so I will only cover that briefly. The actual beginning of OSS is hard to define, but it is at least 20 years old (Cervone, 2003). A key definition for OSS is access to the actual source code, often available under GNU Public License, which allows programmers to alter the software and redistribute it, with the requirement that they make these changes available to other developers. The licenses associated with OSS prevent commercial entities from making these products proprietary (available at: www.gnu.org/copyleft/gpl.html). OSS is sometimes called “free” software but as one author explains, ““free” is used as in the phrase “free speech” (a right we covet), rather than the phrase “free beer” (always too good to be true) or “free kitten” (which sounds good, but has a high overhead)” (Phipps, 2004). The GNU web site offers the following definition of OSS: . The freedom to run the program, for any purpose (freedom 0). . The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this. . The freedom to redistribute copies so you can help your neighbor (freedom 2). . The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this (www.gnu.org/philosophy/free-sw.html). One of my earliest experiences with an OSS-like environment was as a customer of NOTIS. While it doesn’t meet many of the criteria of today’s OSS environments, there are some similarities. Due to the early history of Northwestern University giving away the software, including the source code NOTIS Inc., continued this practice of supplying the source code when the software was purchased. When we as customers encountered a feature that did not work the way we liked, many simply rewrote the code to get the desired functionality. People shared their locally developed code with their fellow customers and assisted with its implementation at other sites. In a few cases, NOTIS, Inc., contracted with customers to develop specific enhancement to the code. However, unlike today’s OSS, NOTIS rarely incorporated their customer’s code into the base program. This created the challenge of reintroducing the local coding changes with each new release and testing to see that they still worked. So why has OSS taken off now? Perhaps one of the key reasons is that the Internet allows for the rapid dissemination and sharing of such programs, communication about the product, and easy sharing through various internet communication tools. A recent article in First Monday states that, “Software is rapidly becoming one of the most fundamental building blocks of human ‘interaction and activity’ (Klang, 2005) and OSS software ties in nicely with the not-for-profit concepts of most school, academic, and public libraries”. There are a number of reasons why a library might chose to develop or implement an OSS application. One of the most obvious would be little or no upfront costs. Additionally, there would be little or no maintenance money paid to a vendor. Like the NOTIS environment, with the proper expertise a site could readily modify the code to

meet local practice or requirements. OSS often provides greater compliance with standards, as commercial vendors may modify those standards to better market other proprietary products that they distribute. In some cases the OSS product can develop faster because there are multiple sites working on various enhancements without the need for a supervisor and associated overhead as in the vendor hierarchy. Local code developers are also closer to the end user and work directly with them to enhance and improve the product. Debugging and troubleshooting is spread across a large number of sites, providing real-world problems that can be tested against, unlike most vendor settings. Some of the possible negatives associated with OSS applications raise the question about who officially provides support: after all, there is no vendor to complain to, and this could make it harder to ensure improvements and fixes are made. OSS may require more technological sophistication to install and support than commercial software. Without someone clearly directing the development there may be a duplication of efforts if multiple sites are working on same problem. There is an assumption that sites using this software will contribute back to the code base and this might be especially difficult for sites with less staff resources or expertise to contribute. Finally, like that free kitten with food and veterinary bills; there are those hidden costs, since your staff is now spending time supporting, tailoring, and enhancing the software. For more information, I direct you to Brenda Chawner’s extensive bibliography of OSS and libraries (www.vuw.ac.nz/staff/brenda_chawner/biblio.html). As mentioned earlier in this introduction, this issue of Library Hi Tech will focus on some of the many OSS applications that have been or are still undergoing development in the library environment. One of the most recent trends in libraries has been that of social bookmarking, enabling users to share information more readily about web sites and articles. Yale’s unalog product gives insight into their work and progress on that product. Another new interest in libraries is that of federated searching, building on the technology of OpenURL link resolvers. dbWiz developed by British Columbia’s Simon Fraser University Library describes the process for an open source federated searching tool. Access to databases often requires authenciation of some sort, especially for remote users. Jackie Wrosch’s article details what one Detroit-based consortium did when they decided the vendor’s product was not adequate for their needs. Libraries of many types are digitizing parts of their collections and college libraries are working with their campus to archive and distribute locally generated information. The article on Greenstone developed by the University of Waikato, in New Zealand, offers information on this particular tool. There is also an article on an institutional repository system, Archime`de, developed by for the French language by Universite´ Laval in Que´bec. A third companion piece details the Open Journal System, which combines management of the submission and acceptance of scholarly articles for a repository developed by the University of British Columbia. I hope you enjoy this issue on the various developments in the OSS library community. Perhaps this will encourage you to experiment with one of these applications, or maybe even develop your own.

Guest editorial

467

LHT 23,4

468

References Cervone, F. (2003), “Open source software: what can it do for your library?”, The Electronic Library, Vol. 21 No. 6, pp. 526-7. Klang, M. (2005), “Free software and open source: the freedom debate and its consequences”, FirstMonday, Vol. 10 No. 3, available at: www.firstmonday.org/issues/issue10_3/klang/ index.html (accessed March 9, 2005). Phipps, S. (2004), “Free speech, free beer, and free software”, c/net news.com, available at: http:// news.com.com/2010-1071-954384.html?tag ¼ fd_nc_1 (accessed May 31, 2005).

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

Experiments in academic social book marking with Unalog Daniel Chudnov Yale Center for Medical Informatics, Yale University School of Medicine, New Haven, Connecticut, USA

Jeffrey Barnett

Academic social book marking

469 Received 13 June 2005 Revised 17 July 2005 Accepted 26 July 2005

Integrated Library Technical Services, Yale University Library, New Haven, Connecticut, USA

Raman Prasad Manuscripts and Archives, Yale University Library, New Haven, Connecticut, USA, and

Matthew Wilcox Yale School of Public Health, New Haven, Connecticut, USA Abstract Purpose – The purpose of this paper is to introduce the Unalog software system, a free and open source toolkit for social book marking in academic environments. Design/methodology/approach – The history, objectives, features, and technical design of Unalog is presented, along with a discussion of planned enhancements. Findings – The Unalog system has been very useful for information sharing among members of the digital library community and a group of beta testers at Yale University, leading its developers to plan several new features and to capitalize on opportunities for integration with other campus systems. Originality/value – This paper describes a freely available toolkit, which can be used to provide new services through libraries to academic communities, and how those new services might be enhanced by merging the potential they offer for easier information sharing with long-standing practices of librarianship. Keywords Internet, Information retrieval, Library and information networks, Information management Paper type Viewpoint

Background At the Access 2003 pre-conference Hackfest in Vancouver, British Columbia, one Hackfest project group examined the question “if we could offer a personalized tracker of OpenURL requests, what sorts of services could we build on top of that?” (Chudnov, 2003) The group imagined a simple screen attached to OpenURL resolver systems that would enable users to quickly “log” their linked articles, to save bibliographic metadata about the article, and optionally to share the reference with other users or groups of colleagues. The group mocked up screenshots of what that screen might look like, and what resulting pages showing long lists of references saved by a variety of users might look like as well.

Library Hi Tech Vol. 23 No. 4, 2005 pp. 469-480 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636274

LHT 23,4

470

After the conference, several Access 2003 attendees contacted project members to indicate their interest in such a system, and a quick working prototype that simplified the sharing of arbitrary web links through a weblog was developed and used by several members of the code4lib online community in November 2003. The prototype was quickly outgrown, and the Unalog system was developed as an open source, Free Software project to serve larger communities who might want to adopt such a system. Around the same time, and independently, the developers of the del.icio.us and Furl social bookmarking systems were working out roughly the same ideas and creating systems of their own. Less than two years later, venture capital now funds del.icio.us, a large internet media concern bought Furl, and a number of similar systems have arisen (such as CiteULike and Connotea) which also serve the goals of saving information about articles and web pages online, and sharing that information in an easy way that supports large, distributed communities (Hammond, 2005). The Unalog system is now in use at multiple sites, including a customized instance at Yale University, and is being partially supported by internal funds at that institution to enhance it to support a wider range of metadata types, and to integrate it with campus courseware solutions. This paper will describe how Unalog works, its features, its technical design, and immediate plans for expanding the scope of metadata management in Unalog. Using Unalog Unalog is a software system that provides a simple way for anyone on the web to record and share information about what they are reading. Visitors to a Unalog site find a current awareness-style list of links shared by all users, and can browse through or search for links shared over time. Shared links can point to any arbitrary web resource. The guiding objectives of Unalog are: . to make the process of sharing and saving links as simple as possible; . to let users easily choose whether to share links or keep them private; . to let users easily choose whether to share links with all users or only with certain groups; . to make all data saved in unalog readily available in a variety of common formats; . to investigate ways that institutions might enable personal digital library services; and . to optimize and integrate how these objectives are achieved to match the needs of users in an academic community. The home page of a running Unalog site provides a view of the most recently added links, much like a weblog. Figure 1 is a screenshot of the home page at the Unalog site running at Unalog.com (the first Unalog instance, which is available for free to anyone in the world). Along the top set of links, a logged-in user can quickly jump to their own entries, keyword tags, groups, and preferences, log out, or access administrative functions. Below the top set of links, the second menu provides quick access to a list of all public users in the system, a list of keyword tags from all users, a list of all public groups in

Academic social book marking

471

Figure 1. The Unalog.com home page, showing recently added links

the system, pages describing how to use the site and contact the administrators, and a search box. Among the listed links, which are grouped by day, each link entry indicates the name of the user who added it, the title of the link (which is also a hyperlink to the linked page), and any keyword tags, group names, or comments annotating that entry. Additionally, for a logged-in user, any link entries created by that user will be highlighted in bold, along with quick links to edit or delete that entry. Similarly, any entries from other users that are assigned to groups, in which the logged-in user is a member, will be highlighted in bold as well.

LHT 23,4

472

Figure 2. Adding a link to Unalog

Sharing links and tags To add links to Unalog, users first create an account at the Unalog site and log in. (Default registration and logging in to a Unalog site are similar to getting an account at many other web sites, with an e-mail confirmation step to verify a user’s identity. Other authentication means may be substituted; these will be discussed later). Once logged in, a user can install a bookmarklet – a specially-formulated web bookmark that, when clicked, sends the title and URL of any web page into the Unalog “add a link” page. Figure 2 shows a user adding a link, in this case a link the user has already added. The Unalog system warns the user that they have already added the link, but allows

them to save the link again if they wish. Other than this warning, Figure 2 shows all the data fields available when a link is added. The page title and URL, copied into Unalog by the bookmarklet, are available to be edited (HTML titles often do not accurately reflect the content of a page). The “Tags” box lets users optionally add one or more keyword tags describing that resource. These keywords are not from controlled vocabularies, but users can choose to use the same keyword tags repeatedly. This model of letting users easily tag their items with uncontrolled keywords has become known as “Folksonomy” and has been popularized in large part by the growth of social book marking systems like Unalog (Wikipedia, 2005a, b, c). The “Is this entry private?” checkbox lets users hide a link entry from other users (privacy options in Unalog are discussed further below), and allows optional comments in an available text area. If the user belongs to one or more user groups, a multi-select list appears so that entries may be assigned to any of those groups (groups are discussed further in the next session). Editing an already-saved link works just like adding a new link, with already-specified tags, comments, and group assignments available to change, augment, or remove. Several views and formats are available throughout the system. For the whole site, links from all users and individual users, all public groups, and all tags can be viewed separately. A list of tags is available for all users, each group, and each user; these lists show the tags used by all users, any single group, or any user, accordingly. Different views of tags are also available, showing all keyword tags, or which keywords are used most. Figure 3 shows the “weighted view” of tags from one user; larger text for a given tag means it has been used often. All of the views are also available in multiple formats: PDA, a simplified display for small “Personal Digital Assistant” devices such as handhelds or phones; RSS, for syndication to other sites (Wikipedia, 2005a, b, c); XBEL, for export to other book marking tools (XBEL, 2005); and both MODS and XOBIS, two nascent bibliographic standards. MODS is a general-purpose XML schema published by the US Library of Congress as a midway-point between the complexities of MARC and the overly simple Dublin Core (Library of Congress, 2005). More extensive use of MODS will be in upcoming releases, as discussed below. XOBIS is a more experimental descriptive standard, still in an early stage of development, and Unalog is one of the first applications to implement XOBIS outside of its institutional home, the Lane Medical Library at Stanford University (Lane Medical Library, 2005). Providing this variety of formats gives users and other software developers a multiple options for integrating Unalog services with other resources and applications. An example of such integration is the Bookmark Synchronizer developed by Art Rhyno, which uses Unalog support for the XBEL format to enable live updates of bookmark lists in remote web browsers (Rhyno, 2004). Incorporating these nascent standards also provides the Unalog developers with an opportunity to understand how best to support their use and provide a more flexible tool. All Unalog entries are indexed immediately when they are first saved or edited. The indexing process includes all aspects of the entries, including URLs, titles, tags, comments, user names, and dates, all of which are indexed both by their individual field names and in one catch-all default category. The search algorithm ranks matches according to relevance, using an augmented term frequency computation. A variety of common search techniques are available using well-known syntax, including Boolean

Academic social book marking

473

LHT 23,4

474

Figure 3. Weighted keyword tags for a Unalog user

operators (“AND”, “OR”, “NOT”) with nesting using parenthesis, fielded searching (restricting matches to, for instance, keyword tags), proximity searching, and value range queries. When combined, these allow users to formulate queries such as: ((user:joey OR user:sally) AND tag:recipe) asparagus date:[2004-05 TO 2004-07] Even so, it is typically easier to search, instead, for: asparagus recipe Indeed, given the success of easy-to-use search interfaces elsewhere on the web, users now expect searches like this to yield useful results. To meet user expectations, Unalog takes the extra step of indexing all values both with their field names and in a default

catchall field. This ensures that user searches like the second example above will most likely return the same hits as the more complicated search. Working with groups Unalog groups are sets of people with common interests. These might be students working together on a class project, members of a research team or a committee, or friends and family members living far apart, who otherwise might send each other links over email. Any registered user can create a group, give it a name, and tell their colleagues about it. By default, new groups are public, which mean anyone can join, and all public groups are listed on a single page to make joining new groups easy. Any member of a group can leave the group at any time. Because groups have their own views of entries assigned to them, and also their own RSS feed and tag lists, a group can be a very convenient way for sets of people to get a more focused view of their data than by searching or browsing through the main page. Users can be members of as many groups they want, and can start as many groups as they want. A user starting one group can be a member of others, and vice-versa. When saving an entry in Unalog, every group to which a user belongs will appear in a list, which allows easy posting to many groups at once. The primary functions of groups are to let a set of people with shared interests “carve out” a smaller space for themselves, with more narrowly focused views of their own shared links than would otherwise be possible through the main page, user pages, and tag pages, and to allow a group to share links in private (discussed more below). Aside from these benefits, there is no functional difference to data saved in groups. Indeed, public entries saved in one or more groups are visible on the main page, the page of the user who added the entry, and any tag pages for keywords the user assigned, as well as the group page. The entry itself is the atomic unit stored in Unalog, and the group association adds the benefits described in this section; it is not “stored in another place.” Private links and groups Any Unalog user, entry, or group can be “private”. A private user can save as many entries as they wish in Unalog, but their entries will not be visible to other users, save for a few special circumstances. When a private user is logged in, they will be able to see all of their own entries, and their tags, and search all of these, but no other users will be able to see the same pages (unless, of course, they are sitting at the same terminal together). A Unalog user can choose to change their account status at any time – from public to private, or from private to public. Links from a public user who switches to private status will become invisible to other users immediately; similarly, links from a private user who switches to public status will be visible to other users immediately. These changes will also be reflected immediately in tag lists and search results. Whether or not a user is private, users can specify that individual link entries are to be private. Private entries allow otherwise public users to hide an occasional entry from other users, without making their other entries private. Public users can also specify that their entries should “default to private” – in this mode, new entries are private by default, and can only be made visible to other users if they are specifically changed from private to public. This lets a user be “mostly private” but still share an occasional entry. Like users themselves, entries can be later edited from public to private, or from private to public, and those changes will be immediately effected throughout the system.

Academic social book marking

475

LHT 23,4

476

Groups can also be made private or public, although the impact of group privacy does not have as wide an impact as user or entry privacy. One benefit of group privacy is to restrict membership: existing members must invite users before new users can join a private group, and invited members can choose to accept or decline invitations. Another benefit of group privacy is that private users who join private groups may share entries with other members of their private groups. This combination supports a pattern whereby any individual who does not wish to advertise their participation in a Unalog community – an administrator, perhaps – can still participate in sharing links with colleagues through private groups. The developers believe this feature is a unique advantage of Unalog, compared to other social book marking systems. Unalog on campus The focus on group and privacy functions in Unalog results from the objective of meeting the anticipated needs of users in academic communities. On an academic campus, a wide variety of group work occurs at many levels, sometimes with a standing charter, and occasionally within very short timeframes. Students can benefit from the ease of creating groups to support their group projects for courses, but might prefer to share links publicly. Similarly, faculty taking on new, unproven research might wish to share references privately while developing ideas, and administrative committees might wish to keep all their work private at all times. The option to have private users, entries, and groups supports a wide range of these arrangements, and multiple combinations of arrangements for users requiring flexibility. Yale University has been running an early beta version of Unalog called “links”, since late Fall 2004. The objective in running a local version of Unalog at Yale is to support the information sharing needs of the community in an easy-to-use system that fits into other campus services. In particular, the “links” instance of Unalog has been integrated with both the campus-wide web authentication and directory services. This integration allows users already familiar with the single web-signon service to log in to “links”, and to recognize that it is a trustworthy system. The directory integration saves users from having to re-key their account information – it uses an LDAP connection to load user information such as unique campus identifiers, names, and email addresses into “links” (Wikipedia, 2005a, b, c). In this way, any Yale affiliate who signs into “links” immediately has a full account and can begin adding entries. The developers also intend to augment this campus integration further by connecting “links” to the library OpenURL resolver (more on this below). Because a primary scenario for information sharing with a system such as Unalog is the course setting, the designers have also simplified the integration of Unalog entries with external systems like courseware servers. In addition to students collaborating on course projects, faculty might wish to send items to their students to read in the less formal (less formal than adding readings to a syllabus) manner that Unalog supports. Professors can use private groups in Unalog to allow peers, or teaching assistants, or students, to share entries. They can then use a special JavaScript call to Unalog that can be configured to include their private group-reading list in an external system. At the Yale School of Medicine, where no local programming is done to customize Blackboard, this provides an easy way for faculty to quickly “assign readings” with a single click of a bookmarklet without logging in to Blackboard. Figure 4 is a screenshot

Academic social book marking

477

Figure 4. Integrating Unalog with courseware at Yale with Blackboard and “links”

of data from a private group in the “links” system at Yale appearing in course in Blackboard at the Yale School of Medicine. As of April 2005, over 100 users have participated in the “links” beta system, sharing over 500 entries with colleagues both publicly and in a variety of private groups. Even though there has been very little marketing of the system on campus during its beta phase, many of these users, who come from a variety of campus departments and schools, have found the system to be useful, and several have contacted the developers to express interest and request enhancements and fixes.

LHT 23,4

478

Yale University is participating in the development of the Sakai suite of tools for course management and related services (Sakai, 2005). Because Sakai provides its own support for campus authentication and authorization as well as close integration of functions for courses, groups and faculty, and because the classroom setting is a primary scenario for Unalog, the link-sharing functions of Unalog will be further integrated into Sakai during summer 2005, and if successful may become a formally-supported version of the service through Sakai for the fall 2005 semester.

Unalog system architecture The software Unalog comprises is written entirely in the Python programming language. Python is a high-level scripting language supporting rapid development without sacrificing scalability of design or performance. The web-based user interface for Unalog is written using the Quixote framework and its Page Template Language (PTL) (Quixote, 2005). The storage backend used by Unalog is ZODB, the object database initially developed for the Zope object publishing system (ZODB, 2005). The ZODB environment also supports rapid development and scales sufficiently well to handle a campus setting (with at most hundreds-to-thousands of users) with little or no difficulty. Whereas using an object database with few built-in query functions, such as ZODB, can be a disadvantage compared to an RDBMS and the SQL standard, by using the PyLucene version of the excellent Lucene information retrieval library the Unalog design is able to offload all search and retrieval functions outside of the object persistence environment (PyLucene, 2005). This simplifies the Unalog codebase and lets Unalog developers leverage complementary functions of each tool – the object database for persistence, and the IR library for searching – without adding unnecessary complexity in design. Nonetheless, because the ZODB storage index is essentially a large hash table, wherein retrieving any item inside the database requires navigating from a top-level object down to the desired item, additional design abstractions are necessary to simplify access to entries, groups, and tag data. In Unalog, a series of similar “Indexes” provide common functions for access to stored information about users, groups, entries, and tags. A system called the “Collector” supports this need with a set of derived classes optimized for each Index type. Depending on the type of page request (for example, a request for a tag index, or for a group’s recent entries), the appropriate type of Collector is called, with parameters for the numbers of items to receive, or a specific date, if specified in the request. The Collector fetches items accordingly, and ensures that the requesting user is only shown retrieved items they are allowed to see. Finally, the Collector returns an EntrySet, which is passed to one of several Formatters, which then renders the EntrySet in the requested format, depending on whether the request was for a web page, or an RSS feed, or perhaps an XBEL export. This model has made it easy to experiment with new types of indexes and formats, and one may anticipate that the Collector interface definition will be easily ported to the Sakai environment, even though its underlying implementation will need to pass queries to a relational database backend instead of the ZODB backend. The Unalog codebase is made available as Free Software under an MIT-style license, with copyright assigned to the Yale University School of Medicine.

Next steps and conclusions In the coming months we intend to enhance the Unalog software by adding more flexible metadata support to accommodate a range of entry types beyond simple web links. We believe the academic community would benefit by enabling Unalog to also share references to journal articles and books, as well as audio and moving image resources, and therefore are experimenting with augmenting how Unalog uses the MODS specification more extensively to add support for these distinct item genres, but want to do so without placing too much of a descriptive burden on Unalog users. Additionally, we intend to support OpenURL-based linking into and out from Unalog entries, and to utilize new OpenURL-driven techniques like metadata autodiscovery to save users time in resolving shared references to fulltext resources and related services (Chudnov, 2005). Another area of future development involves bridging the “Folksonomy” features of social bookmarking systems like Unalog with well-established practices, standards, and databases providing bibliographic authority control. To explore this area, the developers will enhance Unalog to support optional “control” of subject keywords and other terms such as personal or corporate names and resource titles by connecting the system to established authority files using web services and similar techniques. We hope to bring the wealth of resources created over decades of extensive cooperative cataloging work to Unalog users in a manner that is easy for all to use without sacrificing the benefits of bibliographic control. The development team at Yale expects the combination of features Unalog offers, when enhanced with better metadata and linking features, optional support for authority control, and when closely integrated with courseware environments, will provide a compelling solution for academic communities requiring better tools to support information sharing. Readers are invited to try Unalog at Unalog.com, or to experiment themselves with the Unalog software and share feedback with the Unalog developer community. References Chudnov, D. (2003), Access 2003 Hackfest Report, Project: OpenURL Harvesting, available at: http://curtis.med.yale.edu/dchud/talks/20031003-access/img19.html Chudnov, D., Cameron, R., Frumkin, J., Singer, R. and Yee, R. (2005), “Opening up OpenURLs with Autodiscovery”, Ariadne, No. 43, available at: www.ariadne.ac.uk/issue43/chudnov/ Hammond, T., Hannay, T., Lund, B. and Scott, J. (2005), “Social bookmarking tools (I): a general review”, D-Lib Magazine, Vol. 11 No. 4, available at: www.dlib.org/dlib/april05/hammond/ 04hammond.html Lane Medical Library (2005), XOBIS– The XML Organic Bibliographic Information Schema, available at: http://xobis.stanford.edu/ Library of Congress (2005), Metadata Object Description Standard (MODS), available at: available at: www.loc.gov/standards/mods/ PyLucene (2005), available at: http://pylucene.osafoundation.org/ Quixote (2005), available at: www.mems-exchange.org/software/quixote/ Rhyno, A. (2004), Unalog Browser Integration, available at: http://webvoy.uwindsor.ca:8087/ artblog/librarycog/1095861901 Sakai (2005), available at: http://sakaiproject.org/

Academic social book marking

479

LHT 23,4

480

Wikipedia (2005), Folksonomy, available at: http://en.wikipedia.org/wiki/Folksonomy Wikipedia (2005), LDAP, available at: http://en.wikipedia.org/wiki/LDAP Wikipedia (2005), RSS, available at: http://en.wikipedia.org/wiki/RSS_(protocol) XBEL (2005), The XML Bookmark Exchange Language (XBEL), available at: http://pyxml. sourceforge.net/topics/xbel/ ZODB (2005), Zope Object Database (ZODB) Development Wiki, available at: www.zope.org/ Wikis/ZODB/FrontPage

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

Archime`de

Archime`de: a Canadian solution for institutional repository Rida Benjelloun Universite´ Laval Library, Universite´ Laval, Que´bec, Canada

481 Received 7 June 2005 Revised 17 July 2005 Accepted 31 August 2005

Abstract Purpose – The purpose of this paper is to present the main features of Archime`de, which is the institutional repository system developed by Universite´ Laval to address its specific needs. Design/methodology/approach – These needs include the availability of a multilingual interface, the possibility to simultaneously index metadata and full text, and the compatibility with multiple technological infrastructures. The privileged approach relied on open source softwares and the use of automatic code generation tools in order to lower development costs and time. This led Universite´ Laval’s team to the creation of an institutional repository system that is based on Java technology and which is not OS-specific. Findings – The system offers: documents management functionalities; dissemination mechanisms compatible with OAI-PMH2 (Open Archive Initiative Protocol for Metadata Harvesting V.2.0); an indexing and searching framework (LIUS) that can index over ten documents formats; and a selective dissemination of information service. Archime`de and LIUS are now distributed under a GPL licence. Further developments will extend the metadata formats range supported by Archime`de and will include archive management functionalities. Originality/value – This experience shows that the development of an institutional repository system resting on open source softwares, frameworks and application program interfaces could lead to impressive results, in a short amount of time and with a minimum of investment. Keywords Document management, Generation and dissemination of information, Search engines, Java, Archiving, Canada Paper type Technical paper

An institutional repository for Universite´ Laval An institutional repository is a digital archive. It is used to amalgamate and diffuse the scholarly publishing produced by faculties, and institution’s research staff, in order to make it accessible to users within and outside the institution. The goals of such systems are to improve scholarly communication and to disseminate the research results to the community. The concepts of “e-prints” archives have been around for a good while, as shown by the example of Los Alamos Physics Archive which is active since 1991. However, by the end of 2002, we witnessed a growing trend towards the development and implementation of many institutional repositories in universities. This phenomenon is directly related to the exponential production of the academic, born digital, “Grey literature”. During that period, many Universite´ Laval’s research communities were also beginning to develop their own dissemination systems for their own prepublications. We came to a point where we had to face the proliferation of web sites, each presenting its strengths and weaknesses. This

Library Hi Tech Vol. 23 No. 4, 2005 pp. 481-489 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636283

LHT 23,4

482

situation gave the impulse to create Archime`de. Our primary objective was to give access to all this information by providing users with an integrated search tool capable of finding content in all Universite´ Laval’s research communities’ systems. Users would no longer have to browse a great number of web sites to gather the information they need. The goal of this article is to present Archime`de and to explain how it was made. We also want to highlight the fact that this kind of web development is accessible and cost-effective, thanks to the utilization of open source software and automatic code generation tools; they simplify and greatly accelerate the development process. Why developing our own system? Prior to start developing Archime`de, we made a thorough analysis of available open source software solutions and decided to create our own customized application. This choice does not mean that other repository systems were not good, quite the contrary, we took them as models. The main reason to develop archime`de holds essentially to the fact that, amongst the open source software analyzed at that time, no one seemed to be easily adaptable to our organizational context; hence no one met all our needs. A comparison of the different open source repositories (including archime`de) is available on the soros foundation web site in the OSI Guide to IR Software 3rd ed. Since Universite´ Laval Library is a francophone institution, its first criterion was to meet the needs of its French speaking users. The software we were looking for had to have an easy way to integrate additional languages, such as French, into its interface without having to rewrite the code of the interface. However, none of the analyzed solutions were offering this possibility. The second important criterion, for our Library, was to have the possibility to index metadata and full text, from many document formats. Our surveys revealed that most authors were providing very minimal metadata about their documents. Frequently, metadata were only containing a “title” and an “author”. This led us to ask the following question: “If the title is not relevant, how could one find the document?” Most of the tools analyzed were not offering the possibility to index metadata and full-text, and those presenting such features were limited in the documents formats supported. With Archime`de we have taken into account the maximum of formats possible, ensuring a maximum flexibility of our repository system. Finally, as our third criterion, we were looking for a system that could run on Windows as well as on Linux. At this time, Universite´ Laval Library only had Windows servers (and expertise) and we wanted a product that could adapt easily to the technological infrastructure in place, without buying a dedicated server for the software. All the solutions that we have investigated during our analysis were designed for Linux or UNIX, so it was obvious that we had to take the cross-platform approach while developing our application. Aside these three criteria, we had to take account of a tight budget to develop our application. To make this possible, we chose to work with well-known and approved open source software, and used automatic code generation tools for everything related to the database and the persistence layer.

Archime`de What is it? The name “Archime`de” was chosen for its recognizability and its ability to convey an image of scholarly research and discovery. This name is also meaningful for Universite´ Laval scholars because the institution already own systems named after greek mythology and history, such as “Ariane”. As institutional repository software, Archime`de is aimed at the hosting of auto-regulated research communities from Universite´ Laval. The users of these communities are invited to upload their publications with the appropriate description (metadata) in a very convivial and secure interface. In addition to the document management functionalities, Archime`de offers mechanisms for ensuring the dissemination of content. The diffusion is possible through the navigation and research features of Archime`de and through the compatibility to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH 2), thanks to the use of the Dublin Core metadata element set. Also, Archime`de proposes a “selective dissemination of information” service (SDI) that sends to registered users the novelties in their fields of interest. All this information is protected by an access control system based on privileges, allowing limiting access to resources and functionalities of the system. Archime`de Core: an effective, cheap and fast development Archime`de is a web application entirely developed in Java language – J2EE (Java 2 Enterprise Edition). Archime`de was designed accordingly to the design pattern MVC2 (Model View Controller). The MVC paradigm is an approach to programming that breaks an application into three distinctive parts: the model, the view and the controller (http://ootips.org/mvc-pattern.html). The model constitutes the logic of the application, it manages data and functionalities. The view represents the user-side of the application, that is to say the graphical interface. The controller acts as a synchronizer between the model and the view, it accepts the inputs from the user and then communicates to the model the operations to perform, the result is sent back to the view. This kind of architecture has for advantages the simplification of the maintenance and updates of the system, as well as the easy addition of new functionalities. Figure 1 demonstrates the MCV model for Archime`de. The Archime`de model includes two modules. The first is represented by the JavaBeans (objects allowing to contain or to store information) and the second is processing the DAO (Data Object Pattern – allowing common operations on the database like insertions, deletions, updates, etc.). All of the DAO and the JavaBeans were automatically generated from the persistence framework (see Figure 2 for Jakarta Torque (http://db.apache.org/torque/)): A persistence framework moves the program data in its most natural form (in memory objects) to and from a permanent data store the database. The persistence framework manages the database and the mapping between the database and the objects (www.roseindia.net/enterprise/persistenceframework.shtml).

Once the relational database model is conceived, it is possible to use persistence frameworks that plays the role of ORM (Object Relational Mapping) to generate all the codes needed for the database. The ORM allows the manipulation of all the data as an object, these models hide the entire SQL layer. Most of the more common SQL requests,

Archime`de

483

LHT 23,4

484

Figure 1. Archime`de MVC model using Struts framework

Figure 2. Archime`de architecture

such as select, insert, update and delete, were automatically generated (the rate reached 80 percent). Other requests, aimed at a more specific task such as selecting a deposit with all its metadata (a request made on 16 tables), are made with a really simple object language named criteria. Hence, the majority of the code of the “database model” part was automatically generated. For Archime`de, more than 150 Java classes were created this way. Furthermore, approximately 60 classes were manually coded for the controller and the views; this represents one-third of the Archime`de’s classes and, consequently, saves two-thirds on the development time. The view pages, for their part, are using the JavaBeans transmitted by the controller via the request, or session scope, to display the data.

Archime`de uses the Jakarta struts framework to manage the web application. In Figure 1, the dotted line represents the struts part. Thus, for the programming of the controller, Archime`de is based on struts actions that are intended to catch the request of the user and process it with the DAO using JavaBeans and (see model part of the MCV) then redirect to the view side for display. All the pages of the view are made with JSP (Java Server Page).

Archime`de

485 Software and architectural components Archime`de is the result of the use and juxtaposition of several open source software, frameworks and APIs. Figure 2 illustrates the architecture of Archime`de. In Figure 2, we see the interrelations existing between the system components. Thus, when a user is doing an operation or consulting a page in Archime`de, the request is directed to the web server (Apache web server or internet information server). As this server cannot execute Java server-side programs, the request is redirected towards a Servlet container. In our example, Apache Tomcat is executing the Java server-side code. Then, if the user wants to see a page that is public, that is to say without any access restriction, the request is directed to the Jakarta Struts framework. If the user asks for a page with restricted access, the system will use the security framework to manage authentication procedure and will then transmit the information to the Struts framework. This done, Struts completes the operation initiated by the user, following the MVC Pattern. As we saw in Figure 1, the Struts controller will communicate with the model represented in Figure 2, under “Archime`de API”. The task of the model is to execute the operations of the system each time a software component is solicited, like creating user accounts, collections, document deposits, indexing, searching, etc. For instance, when creating a user account, the Torque framework and the database are involved; indexing is performed by the LIUS framework, etc. The system was developed over a period of seven months. The project required the collaboration of one architect (half time) and one programmer (full time). Prior to the first distribution, we also asked someone external to the project to test the product, to ensure that he could configure and use Archime`de easily. System maintenance is also simple and only involves a system administrator to create the research communities administrators and to backup the database on a regular basis. The first releases of Archime`de were already robust and stable. However, an important factor to take in account is the server that will run the system. A “deposit” is composed of the metadata and the files, which represent the biggest part of the “deposit” in terms of storage space, and which are managed by the file system of the operating system. So, this aspect has to be carefully planned to ensure a successful and scalable Archime`de implantation. System characteristics Safety Archime`de offers a secured module based on privileges. This allows restricting resources and functionalities access accordingly to the user logging-in. Thus, we have five users types with corresponding privileges: (1) Visitor: Each visitor of the institutional repository can browse all the public deposits of the research communities. The visitor status authorizes browsing and permits simple search or advanced search.

LHT 23,4

486

(2) Community user: User of a research community can browse all his specific community deposits and the public deposits of other research communities. (3) Community member: Member of a research community can browse all his specific community deposits and the public deposits of other research communities. He may also add or remove documents in his community collection. (4) Community administrator: Administrator can browse all his community deposits and the public deposits of other research communities. He can create or delete collections in his own community, upload documents or delete any deposit created by his community members. Only the community administrator can create the collections were the deposits are uploaded. (5) System administrator: System administrator has all the rights – creation, modification, deletion. He also has full access to the system management functionalities such as the importation, exportation, re-indexing, etc. Flexibility and adaptability It is possible to adapt all the graphical interface of Archime`de for specific utilizations. The system is based on the internationalization (i18n (www.w3.org/International/)) principle; hence, all the textual content is independent from the code, making it very easy to add new languages support into the interface without rewriting the code (see Figures 3-5). Moreover, Archime`de does not require a particular OS to run. The system is based on Java technology so it can equally be installed on a Linux or Windows server. This independence from the operating system (OS) allows an easy integration of Archime`de to the existing technological infrastructure and a reduced adaptation time. Finally, Archime`de is compatible with any database having a JDBC driver.

Figure 3. Archime`de research interface

Archime`de

487

Figure 4. An Archime`de search results example

Figure 5. An Archime`de research group homepage

Information structuring Archime`de is a decentralized system arranged around research communities. The communities have the responsibility to manage their documents within their collections. With the secured deposit system, users can upload their documents into the right collections and describe them with the appropriate metadata using the Dublin Core elements. Archime`de also presents a workflow system allowing the community administrator to validate the deposits. The workflow system can be enabled or disabled, following the needs of the community.

LHT 23,4

488

Information dissemination mechanisms Search engine. Archime`de is using LIUS (www.bibl.ulaval.ca/lius/) (Lucene Index Update and Search), an indexing and searching framework developed at Universite´ Laval Library, based on Apache Lucene. LIUS allows the indexing of XML, HTML, PDF, RTF, MS Word, MS Excel, PowerPoint, Open Office Suite, TXT, Java objects (JavaBeans) plus mixed-indexing, integrating in the same occurrence the XML metadata and the full text, such as PDF, HTML, etc. These characteristics enable user to perform a metadata search (creator, title, subject, etc.) and a “full text” search. LIUS offers innovative and powerful XML indexing capabilities. Hence, LIUS can perform an indexing based on the namespaces, so it is possible to define a set of indexing properties and let the application do the work. Furthermore, you can define indexing properties according to the Dublin Core or ETDMS (Electronic Thesis and Dissertation Metadata Schema) or any XML document type, and specify to LIUS wich namespace to use, so it will index the document with the right properties. All of the XML indexing properties are relying on XPATH to select the elements. If the document does not contain any namespaces, LIUS will use the default property values specified by the user of the framework. LIUS can also index a document by “XML nodes”, so if a document contains 50 metadata, it is possible to ask LIUS to store each metadata in the document as a new occurrence. Furthermore, LIUS offers advanced HTML indexing possibilities, that is the reason why each indexed document is converted into XML: with XPATH, we can select every HTML tags we want. Finally, LIUS allows “boosting” of specific document fields, so one could make the “title” and the “abstract” of a document more important than the “full-text”. It is also possible to “boost” a desired document format. For example, one may decide that the PDF file in a deposit is more important than the Excel file and, consequently, the PDF file will come first when the user performs a search on the repository. All these features are configured into the framework with a simple XML file. By default, Archime`de is using the following LIUS index fields which can be customized to suit an organization specific needs (Table I). Browsing system. Archime`de offers browsing by research communities, collections, titles, creators, etc. It also features a sorting option that ignores stop words, thus a title like “The Book” will be correctly classified under “B” letter, not “T”. OAI. Archime`de is an OAI data provider, thanks to the OAI-CAT Open Source OCLC (Online Computer Library Center) software. It is compatible with OAI-PMH 2 and the metadata are in the Dublin Core format. Lius

Table I. The LIUS index fields

Full text Title Author Subject Description Publisher

Contributor Bibliography Source Language Coverage Rihts Any other custom field

SDI. A selective dissemination of information (SDI) service is provided to keep users informed of the novelties in their field of interest. All system users can subscribe to a research community or a collection and will be notified immediately, in their personal web page, and by an e-mail sent each week when some new stuff is added. URI: All the resources in the system are identified by an URI (Uniform Resource Identifier) and we ensure that it is persistent. This allows authors to cite documents on Archime`de without being worried about a link change. We are meeting with faculty individually to present Archime`de and to encourage them to participate in the project. All of the faculty that we have met during the pilot project have shown an interest to participate and currently there are already seven research groups into the system. This autumn, we will launch a communication campaign all over the campus and we expect to see the user base to grow in the next months. Polyvalence and portability Archime`de can import/export metadata in many formats using XSLT transformations that will map between Dublin Core elements and importation/exportation elements. Furthermore, all the data uploaded to Archime`de are independent from the application. Each deposit is created in a directory structure and is associated with its corresponding metadata. So, it is possible to migrate these data in another system if desired. Conclusion and future work The indexing and research capabilities of Archime`de have caught the attention of the open source community. This led us to make a separate package for LIUS and distribute it as a distinctive tool. Consequently, LIUS appears to be another contribution of Universite´ Laval Library to the open source community; it only takes a few lines of code to add it to any Java application to expand its documents indexing capabilities. It also offers research capabilities and returns results in XML format. Now, we are considering enlarging the Archime`de scope and providing a new module to support new metadata formats. This could make Archime`de more flexible because it will respond to any specific needs regarding the document description. We are planning the integration of Java Content Repository (JSR-170) using Jackrabbit (www.jcp.org/en/jsr/detail?id ¼ 170) to extend the database capacities with versioning, enhanced data references and added security. Furthermore, we will replace the “Upload” module, currently based on HTTP, by WEBDAV (www.webdav.org/). Finally, we are about to add to Archime`de archive management functionalities through the addition of a “classification plan” and a “records retention schedule”. Furthermore, we will implement Java Portlet (JSR 168) and will integrate LDAP capabilities for institutional authenticating mechanism. These developments are planned for June 2005. With the incoming new version of Archime`de we are planning to put in place a developers community around the project. The Universite´ Laval Archime`de implementation can be found at http://archimede. bibl.ulaval.ca. The packages can be downloaded from www.bibl.ulaval.ca/archimede/ index.en.html

Archime`de

489

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

dbWiz: open source federated searching for academic libraries

490

Calvin Mah and Kevin Stranack Simon Fraser University Library, Burnaby, Canada

Received 7 June 2005 Revised 17 July 2005 Accepted 26 July 2005

Abstract Purpose – The purpose of this paper is to describe the experiences of developing an open source federated searching tool. It is hoped that it will generate interest not only in dbWiz, but in the many other open source projects either completed or in development at the Simon Fraser University Library and at other libraries around the world. Design/methodology/approach – The methods used in this paper include reviewing of related literature, analysis of other federated search tools, and the observation and description of the development process at the Simon Fraser University Library. Findings – The paper discusses the benefits and challenges faced in developing an open source federated searching tool for libraries. As a case study, it demonstrates the strength of the collaborative, open source development model. The paper also describes the key features required of any federated searching tool. Originality/value – Federated searching is becoming an important new product for both academic and public libraries, with several commercial products to choose from. This paper describes the development of an open source federated search tool that provides a low-cost, yet highly functional alternative for the wider library community. Keywords Information retrieval, Library systems, Academic libraries, Computer software Paper type Technical paper

Introduction Faced with the choice between multiple subscription databases, each with a different interface and search functions, and the simplicity of Google, college and university students are increasingly finding their research materials on the open web. Federated searching provides one way that academic libraries can begin to win back some of these novice users, and ensure they are finding the highest quality information available. dbWiz is an open source federated searching tool currently being developed at the Simon Fraser University Library, funded by nine partner post-secondary institutions. The SFU Library has been developing open source software for a several years, including the reSearcher software suite, which features a link resolver (GODOT), a serials management knowledge base (CUFTS), electronic resource management tools, and more. This article provides an overview of the dbWiz development process, the functionality of the software, and discusses some of the benefits and challenges faced by the project. Library Hi Tech Vol. 23 No. 4, 2005 pp. 490-503 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636292

Federated searching Federated searching, also known as metasearching, broadcast searching, cross searching, and a variety of other names, is the ability to search multiple information

resources from a single interface and return an integrated set of results. Although aspects of this kind of shared searching has existed for some time (especially with Z39.50 catalogue searching), the explosion of online content and the rise of Google as the dominant web-based search tool has made the development of this kind of searching more important than ever. Google has set a new standard for fast and easy to use searching that brings back “good enough” results almost every time. Increasingly, many library users are relying solely on this powerful search engine to the neglect of the valuable subscription collections libraries provide (Luther, 2003). By paying attention to what Google is doing right, by “breaking down the information silos” (Webster, 2004a), and by examining the technology that is now available to make systems interoperate more efficiently than ever before, libraries can make sure that “good enough” results get even better through the use of federated search tools such as dbWiz.

dbWiz open source searching

491

Project background dbWiz was originally developed to help students determine the best starting point for their research. Within the Council of Prairie and Pacific University Library (COPPUL) consortium, there was concern that novice library users would be overwhelmed by the growing number of online resources available, and required an online tool to assist them in the absence of a librarian. Building on the Simon Fraser University Library’s experience with Z39.50 searching gained in the development of the GODOT link resolver and interlibrary search and request system, we were able to develop dbWiz version one. By entering a search term into dbWiz version one (see Figure 1), users would be presented with a list of databases with the number of results that would be found by

Figure 1. dbWiz version one search interface

LHT 23,4

492

searching the native interface. For example, a dbWiz search for “memory” would return a ranked list of resources with a hit count and the availability of full-text (see Figure 2). The link would then take the user to the search page for that resource, where they would re-enter their search term to view the results. Although we had successfully met the challenge posed to us the previous year, we were somewhat disappointed by the response from the wider community. While we may have thought that this would a valuable addition to the online research toolkit, our users did not. We quickly realized that what people really wanted were the search results, from one search interface, delivered in a unified result set – in short, full federated searching. Moving to version two Based on a review of several available federated searching tools (including dbWiz version one and other commercial products), we developed a seven-point list of key features that any successful federated search tool we could create would need to have. These included: (1) Retrieve and display integrated search results. (2) Splitting the results coming from Proquest by database (ABI/Inform, CBCA, and others). dbWiz One was unable to do this. (3) Provide direct linking to the full-text content or to the citation. (4) Sort results. (5) Allow for searching by author title and subject (when available for that resource).

Figure 2. dbWiz version one search results

(6) Ability to limit the results to academic materials, full-text, and/or by date. (7) Add Boolean searching. While this was an ambitious step from the original dbWiz project, we were confident that with funding, we would be able to meet the challenge. Based on the estimated time to implement each feature, we developed a budget for moving the project forward. The next step was to raise the money. Proposal With dbWiz version one successfully completed as a prototype, together with a logical list of functionality enhancements and a clear cost analysis, we approached other post-secondary libraries in western Canada for support in developing this new product. The result was the creation of the dbWiz partnership in the summer of 2004, with nine member institutions each contributing funds determined by the size of their organization. Each partner library is consulted throughout the development process, will have their own dbWiz profile set up for their library, and will receive one year of support. As part of the partnership, each member library will also pay a modest annual fee to ensure the ongoing maintenance of the project. Development team With funding in place, we were able to create a dbWiz development team to see the project through to completion. The team consists of five members, including programmers working full-time on the software and librarians taking on project responsibilities in addition to their other duties. The members include: a systems coordinator, responsible for overall project management; a lead programmer, responsible for the overall programming of the software, but focusing on the search processes and interface design; a second programmer working primarily on the development of the many search plugins; a systems librarian, providing research and analysis; and a support librarian, working with the partner libraries, gathering feedback, providing updates, and developing documentation and implementation assistance. The total time to move from the research stage to software implementation will be about 18 months. The diverse backgrounds and expertise of the different team members has kept the dbWiz project on track for development target dates and budget requirements, while rapidly building a highly functional product. Resource list The initial list of priority resources for dbWiz version two consisted of the databases shared by all nine members. These included the major aggregators, e-journal resources, important periodical indexes, and e-book collections. As all of the partners were members of the COPPUL consortium, there were many resources that overlapped. Once these were identified, smaller databases shared by the majority of partners were included, and finally important unique or local databases were added, such as library catalogues, local institutional repositories, Simon Fraser University Library’s Editorial Cartoons database, the Vancouver Public Library’s Historical Photograph collection, and others.

dbWiz open source searching

493

LHT 23,4

494

Search plugins Once we had decided upon a target set of resources, we were able to begin programming for version two, focusing first on building the many search plugins. Written as small Perl modules, these plugins are necessary for dbWiz to communicate with the diverse range of online databases and to understand the results that are returned from the resources. Although the current plugins are written in Perl, it would also be possible to create new plugins in any language, such as java or python. There are two main access mechanisms that the vendors support which allow dbWiz to interact with their databases. First, is by web access only. This describes the majority of the commercial resources we subscribe to. The vendor provides no access to their database other than through a web browser. In order for dbWiz to search this type of database, the search plugin simulates a person searching the database, but instead of results going to a web browser, the results are parsed by dbWiz. Although this method works, it requires maintenance. If the vendor changes what the way the search results are displayed, the dbWiz plugin will require modification. The second access mechanism is through an application program interface, commonly known as an API. An API is a set of instructions on how to write your own software to connect to the vendor’s database. These APIs are usually, but not limited to, Z39.50. This access method is the most reliable since it uses a documented interface from which to search and retrieve results. dbWiz crafts the search according to the rules of the API and retrieves the results likewise. By 2005, over 100 APIs had been written for dbWiz version two. Parasearch dbWiz uses a parallel searching engine, called Parasearch. In order for the dbWiz user interface to run efficiently, the work of doing the actual searches using the plugins must be off loaded to a separate process. dbWiz communicates to the Parasearch server via the simple object access protocol (SOAP), a standard method of exchanging XML information over computer networks. SOAP is used because it is lightweight, based on XML, and uses HTTP, the same protocol as the worldwide web as it’s transport. A typical search goes through the following steps: (1) A typical user types in a search in the dbWiz user interface and clicks on “search”; (2) dbWiz sends the search that the user entered, along with the resources that the user wishes to search, to the Parasearch server via a SOAP call; (3) The Parasearch server looks at each resource that it is asked to search and calls the appropriate search plugin; (4) Parasearch returns the search results to the dbWiz web server and logs out of the resource when there is only a limited number of concurrent users permitted; (5) dbWiz collects the search results and presents it to the user. Authentication A typical installation of dbWiz searches licensed databases and resources. These resources are limited to the IP address of the subscribing institution. dbWiz uses EZproxy, a URL rewriting proxy server to search resources on another institution’s behalf.

dbWiz uses IP address authentication to determine which partner institution a dbWiz user belongs to. If a dbWiz user’s IP address does not match an address that belongs to a partner institution, the user will only be allowed to search databases and resources that are not licensed or restricted. A user searching dbWiz from a home internet connection, which will yield an IP address which dbWiz does not recognize, will have to connect to his own partner institution’s proxy server to search dbWiz.

dbWiz open source searching

495 Search process As a federated search tool, dbWiz needs to translate the search syntax from a single interface into the search syntax of multiple interfaces. The search syntaxes of varying resources differ greatly. In order to support a minimum set of search functions available in dbWiz, we had to decide what features were common to enough of the resources we wanted to search and include them in dbWiz. As a result, dbWiz searches the following indexes: keyword, author, title and subject. In cases where a search target does not support searching via author, title or subject, dbWiz uses the keyword index as a default. System response time is also a crucial consideration. For dbWiz to be a useful search tool, it must return results in a reasonable amount of time. dbWiz searches all the resources simultaneously and uses a timeout strategy for determining which results to include with the final set. Resources that take too long to respond are dropped from the result list. Needing to choose between completeness and responsiveness, we decided that the priority for our target users was speed. dbWiz is built on the same shared host model as the other SFU Library open source products. In our initial implementation, dbWiz is hosted at SFU for nine partner institutions. dbWiz uses an IP address database to determine which site an incoming connection belongs to. A configuration database stores the customized settings and templates that each site has set. The user is presented with the profiles and resources that are only available to his current site. Search interface By February 2005, a new user interface had been created. Also written in Perl, the search interface runs in an Apache/Mod_Perl environment. A key consideration in the development of dbWiz is customization. Since dbWiz was initially hosted at Simon Fraser University Library for all of the partners, the interface design was an immediate priority. To address the issue of customization, the Perl Template Toolkit was selected as a user interface creation tool. Using the Template Toolkit allows each individual institution to customize the elements of the dbWiz user interface to their own liking. In the current state of development, dbWiz can be searched in three different ways. First, users may select from a predefined category of resources (see Figure 3). Choosing the category “Computer Science”, for example, would allow for the simultaneous search of a variety of resources on the subject, including journal article collections, e-books, online encyclopaedias and dictionaries. By focusing on subject-specific collections whenever possible, libraries assist their users in increasing the relevance of their searching. Second, users are able to select individual resources to search (see Figure 4). This feature will be of interest to a moderately experienced searcher who may have become familiar with particular resources from their previous searching experience. The

LHT 23,4

496

Figure 3. Search categories

Figure 4. Resource selection

Advanced interface consists of guided search fields and a list of searchable resources. While there is no theoretical limit on the number of resources that can be searched at once, response time will degrade with as selections are added. Finally, it is also possible to embed a dbWiz search box directly into any web page. By placing dbWiz directly on a subject guide page, for example, users can quickly search the best resources for that discipline (see Figure 5).

dbWiz open source searching

497

Figure 5. Embedded searching

This method can also allow libraries to create targeted searches, taking the search tool directly to the user (Abram, 2005), by placing a dbWiz link onto web pages outside of the library web site, such as university department pages or to individual courseware pages, with categories created for a specific discipline, course, or even assignment. In all three search methods, users are able to use Boolean searching (AND, OR, NOT) whenever these are supported by the native interface. For any federated search tool, the functionality is always limited to what is available from the resources being searched. While this has led to some concern about “dumbed down” searches (Luther, 2003), federated searching provides a service that is central to what libraries have always done – bringing together resources into one place and making them easy to find (Webster, 2004b). Results interface Once the search has been initiated, a set of integrated results is returned (see Figure 6). A library can determine how many records (10, 20, or more) are brought back from each resource. The greater number of records, however, the longer the search may take, and the longer the list of results to work through. The results are displayed by whichever sorting option a library prefers (often date), but can be reordered by title, source, resource, and date. Determining relevance can be a significant challenge, as what is relevant to one user may not be relevant for another (Tennant, 2003). Currently, we are working on a relevance algorithm determined by a count of keywords found in the records, as our target users have become used to results sorted in this manner. Google does not return its results by date, source, or any other format other than relevance, and we need to ensure we are developing a system that will work the way our users expect. Each record from a set of search results will display (when available) a brief amount of information, including title, author, date, journal title, and resource. Search limits

LHT 23,4

498

Figure 6. Result list

currently include by date range, by academic, non-academic, or both, and by resources with full-text, without full-text, or both. dbWiz also provides a direct link to the original record in the native interface or a link to an OpenURL resolver when a direct link is not provided. These links are crucial for users looking for fast and easy access to their research material. Ideally, each dbWiz record would contain a link directly back to the native record, however for resources that only allow only a limited number of concurrent users, we have opted to have dbWiz disconnect from the resource upon retrieving the results. This frees the resource for other users. In these cases, the OpenURL link will provide the most direct route to the original record. Search results also provide a link to retrieve the next set records for that resource (“get more records like this one”) as well as a link to the native search interface for that resource (e.g. “search SFU Library Editorial Cartoons”), allowing for direct and more in-depth searching. dbWiz also maintains a search history, allowing users to quickly re-create earlier searches. The search history also provides an instructional function, displaying the names and number of results from each resource searched. Although the results may be sorted in dbWiz in a variety of formats, the search history presents the original resources in a prominent and consistent manner. Because the interface has been created with templates and style sheets, each library has the ability to redesign their own results pages, allowing for a list that more closely resembles that of a simplified search engine or a detailed subscription database. Administrative interface The administrative interface allows libraries to create their own search categories and add or delete resources quickly and easily from their dbWiz profile. The administrative interface borrows the source code from GODOT, Simon Fraser University Library’s open source link resolver. The parameters and settings of the administrative interface are stored in a postgreSQL database.

dbWiz not only needs to be fast and easy to use for students, but also for the library staff who will be adding new resources, creating and maintaining the search categories, and customizing the dbWiz interface. After logging into the administrative interface, library staff can access the global list of dbWiz resources, and activate the ones to include in their local collection (see Figure 7). Creating categories, whether based on a subject, a course, or even an assignment, can be done by simply entering in a new category term (see Figure 8) and adding the related resources from the local resource listing (see Figure 9). Finally, the templates and style sheets (see Figure 10) can all be edited from within the administrative interface, allowing libraries to highly customize the appearance of their dbWiz interface, including the ability to change colours, fonts, headers, logos, wording, and the location of information on the screen. A development “sandbox” is also available, allowing local administrators to test any changes they have made before transferring them to their active version of dbWiz.

dbWiz open source searching

499

Communication strategy As a development project that involves partners from across a large geographical area, we needed to ensure that we were maintaining effective communication from the beginning. We initially relied upon traditional methods such as a web site, an e-mail

Figure 7. Activating local resources

Figure 8. Creating search categories

LHT 23,4

500

Figure 9. Adding resources to a category

Figure 10. Configuring templates

list, presentations at individual libraries, consultations, discussions at consortial library meetings, and conference presentations. Two new means of communication include the creation of a web-based video screencast, which allows partner libraries to get a look at the software in action, without requiring an on-site visit. The latest screencast can be viewed at http://theresearcher.ca We also created a project wiki for dbWiz (see Appendix 1). This provides an interactive web site where we post project documents, gather user feedback, and have begun to create a wish list for a potential third version of dbWiz. The wish list has been very important in allowing us to remain open to innovations and new ideas without feeling the need to incorporate every suggestion, without delaying our committed release date. Unlike a traditional web site, which generally offers only one-way communication, the wiki provides a truly collaborative space for sharing ideas and

comments. The wiki is open for anyone to create a free account and begin posting their own ideas or responding to comments. Implementation Development of dbWiz will be continuing into the summer of 2005 based on feedback received from the partner libraries. Usability testing with our target audience of novice users will also be an important part of the refinement process. We will begin implementing the product in July, in anticipation of the beginning of a new academic year. In the fall of that same year, we will be releasing the source code under the GNU General Public License, allowing anyone to download, install, and modify their own copy of dbWiz. We are very interested in seeing dbWiz move beyond the original partners and outside of academic libraries. We believe strongly that the product will prove beneficial to the wider library community. Ongoing challenges One of the most significant challenges in developing a federated search tool is creating and maintaining a local system that needs to interoperate with so many diverse systems. For this reason, the importance of the National Information Standards Organization (2005) Metasearching Initiative cannot be understated. By bringing together the major stakeholders in federated searching, including libraries, content providers, and developers, NISO is facilitating the discussions that need to happen between these unique players with both common and divergent interests. Their work in standardizing access management, collection description, and search and retrieval processes will make federated searching easier, faster, and more efficient for everyone, whether operating in a commercial or an open source environment. Another key challenge will be the ongoing maintenance of the search plugins for the different resources. As Hane (2003) mentions in “The truth about federated searching”, if there are more than 100 resources being searched, and each one changes an average of two or three times a year, that still averages out to an update needed almost every day! We are currently developing an automated search system that will run overnight and report any errors discovered, allowing us to rapidly update any plugin that fails due to an unexpected change in a resource. In addition to maintaining the existing resource plugins in dbWiz, we are also challenged by the number of resources that our partner libraries will continue to acquire and the need to create new plugins for all of these. In both cases, the maintenance of the project requires an ongoing commitment beyond the initial development of the software. Robust relevance ranking is one of the key factors that put Google to the top of web searching. By crawling and indexing an ever larger part of the web, Google is able to apply their unique relevance algorithm to the collected data and produce highly relevant results. For federated searching however, creating our own index is not an option, due to our limited access to the vendor data. Again, efforts at standardization may help with this problem, but for the current version of dbWiz we will be applying some fairly simple keyword counting from the returned records to determine some basic relevance. Deduplication is another issue all federated search products need to deal with. Due to the lack of standards in how the data in the different resources is structured,

dbWiz open source searching

501

LHT 23,4

502

determining duplicates programmatically is very difficult. Also, as federated searching only downloads a portion of the search results (e.g. the first ten from each resource), deduping among the complete results is “virtually impossible” (Hane, 2003). We are anticipating some progress from the NISO Metasearch Initiative before we begin to explore the options for deduplication for any future versions of dbWiz. Project significance Despite these challenges, the benefits of developing dbWiz are significant. The ability to introduce our many Google users to the rich content in our libraries – and allow them to see for themselves what our subscription resources have to offer, cannot be over-estimated. Based on their experience with dbWiz, these novice users may even begin to try some of those resources directly, making it an important information literacy tool, allowing users to “learn by doing”. The dbWiz project also highlights the importance of library collaboration. A project of this scale would be too ambitious for any but the largest of libraries to attempt on their own. It would also be prohibitively expensive to maintain and keep up-to-date. Through working together, however, several medium and small academic libraries have been able to fund this project and see a successful, supported product emerge. The dbWiz project also reveals the power of the open source development model. The tools used to build dbWiz, including Apache for the web server, mySQL and postgreSQL for the database, and Perl as the programming language are all open source and are among the most robust and stable available. If dbWiz is adopted by other institutions or consortia, there is opportunity for participatory development within the open source model – which further complements and enhances the collaboration mentioned above. Finally, commercial federated searching products are beyond the budget of many medium or small academic or public libraries. Through the licensing of dbWiz under the GNU General Public License, these libraries will now have an entry point into the world of federated searching. Conclusion Although it a very new product, dbWiz will soon become a mature component of the open source library environment. Based on a clear set of development objectives and secure funding from several partner institutions, dbWiz has moved from a proof of concept prototype to a fully-functional federated searching tool in just over a year. Its success reflects the benefits of federated searching, the open source model, and library collaboration. References Abram, S. (2005), “The Google opportunity”, Library Journal, Vol. 130 No. 2, pp. 34-5. Hane, P. (2003), “The truth about federated searching”, Information Today, Vol. 20 No. 9, p. 24. Luther, J. (2003), “Trumping Google?, Metasearching’s promise”, Library Journal, Vol. 128 No. 16, pp. 36-9. National Information Standards Organization (2005), Metasearching Initiative, available at: www.niso.org/committees/MetaSearch-info.html (accessed 17 March 2005).

Tennant, R. (2003), “The right solution: federated search tools”, Library Journal, Vol. 128 No. 11, pp. 28-9. Webster, P. (2004a), “Breaking down information silos: integrating online information”, Online, Vol. 28 No. 6, pp. 30-4. Webster, P. (2004b), “Metasearching in an academic environment”, Online, Vol. 28 No. 2, pp. 20-3. Appendix. Additional resources Apache Software Foundation (2005), Apache, available at: www.apache.org/ (accessed 17 March 2005). Free Software Foundation (2005), The GNU General Public License, available at: www.fsf.org/ licensing/licenses/index_html#GPL (accessed 17 March 2005). MySQL AB (2005), MySQL, available at: www.mysql.com/ (accessed 17 March 2005). Perl Foundation (2005), The Perl Directory, available at: www.Perl.org/ (accessed 17 March 2005). PostgreSQL Global Development Group (2005), PostgreSQL, available at: www.postgresql.org/ (accessed 17 March 2005). Simon Fraser University Library (2005), dbWiz Project Wiki, available at: http://lib-cufts.lib.sfu. ca/twiki/bin/view/DbWiz/ (accessed 17 March 2005). Simon Fraser University Library (2005), The reSearcher, available at: http://theresearcher.ca/ (accessed 17 March 2005). Useful Utilities (2005), EZproxy by Useful Utilities, available at: www.usefulutilities.com/ (accessed 17 March 2005). Wardley, A. (2004), Template Toolkit, available at: www.template-toolkit.org/ (accessed 17 March 2005).

dbWiz open source searching

503

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

504

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

Open Journal Systems An example of open source software for journal management and publishing John Willinsky

Received 7 June 2005 Revised 17 July 2005 Accepted 20 July 2005

Faculty of Education, University of British Columbia, Vancouver, Canada Abstract Purpose – To provide an insider’s review of the journal management and publishing software, Open Journal Systems (OJS), from the Public Knowledge Project, which the author directs at the University of British Columbia. Design/methodology/approach – The paper outlines the history, development, and features of OJS, including some of the experimental aspects, as well as early research results and work underway, on which it is based. Findings – OJS (http://pkp.sfu.ca/ojs) is an open source solution to managing and publishing scholarly journals online, which can reduce publishing costs compared to print and other traditional publishing processes. It is a highly flexible editor-operated journal management and publishing system that can be downloaded for free and installed on a local web server. Originality/value – OJS has been designed to reduce the time and energy devoted to the clerical and managerial tasks associated with editing a journal, while improving the record keeping and efficiency of editorial processes. It seeks to improve the scholarly and public quality of journal publishing through a number of innovations, from making journal policies more transparent to improving indexing. Keywords Academic libraries, Electronic publishing, Serials Paper type Technical paper

Library Hi Tech Vol. 23 No. 4, 2005 pp. 504-519 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636300

Introduction Open Journal Systems (OJS) was originally developed as part of the research program of the Public Knowledge Project (PKP) which I direct at the University of British Columbia[1]. It is one of a number of open source journal management and publishing software available today, and much of the functionality described below applies to other open source systems such as Hyperjournal, eFirst XML, and the forthcoming DPubS, well as to proprietary systems, such as AllenTrack and Bench . Mark[2]. PKP had its origins during the mid-1990s in research efforts to design and create knowledge management systems that would increase the contribution that educational research made to the lives and work of teachers, administrators, policymakers, and the public. In a series of projects, PKP represented an early effort to take advantage of the initial, heady days of the internet, when this brave new world wide web promised to open the doors to all of the knowledge that had been previously available only in research libraries.

In the course of developing a number of experimental systems for making research more widely available and for integrating that research into a range of related materials as part of publicly available web sites, we found that such initiatives were able to secure the cooperation of the media in, for example, conducting demonstration projects that linked press coverage of an educational issue to research on the relevant topics. We were also able to secure government cooperation in setting up policy review sites that were informed by access to the relevant research (Klinger, 2001). What we could not secure was widespread access to the research literature that was needed to make these media and government ventures in public knowledge work. The problem with enhancing the quality of public knowledge was not that educators were too busy with teaching or that policymakers were too caught up in local politics, or that the public was simply indifferent to research in their endless thirst for infotainment. No, the problem lay at the very source of the knowledge in question. The problem was the academic community, and its failure to make what it had learned publicly available. I felt I had little choice, at that point, but to turn my attentions to the whether access to research could be increased and improved. Soon after I began to direct my work toward the study of how this access could be improved, whether by having authors self-archive their work or by moving journals into open access publishing, I was confronted by the question of what it cost to run a journal online and whether the savings on online management and publishing, if any, could form the basis of running the journal under some form of an open access model. How could I ask my colleagues to consider the open access journal if I had no idea what it might cost? I only had to be asked that question twice in presentations, before I decided that I had to determine an answer to it. I hired Larry Wolfson, a graduate student research assistant with an economics background, to scour the emerging literature on online publishing for costs, as well as run a small survey among editors of online journals on this matter. It was not hard to find answers to the question, although that gave rise to a new problem. There were far too many answers to the question, with huge differences among the answers. Our inquiry certainly got off to a good start. Larry sent off e-mails to editors of electronic journals asking about their costs, while he started to scour the literature in search of published figures on online journal costs. However, before he had sent out more than a handful of e-mail queries, he had an answer back from Gene Glass, who had founded Education Policy Analysis Archives (EPAA) in 1993 as a “born digital” peer-reviewed journal. Glass was blunt and multilingual about his business model, when it came to describing his operating costs: “Zero, nada, no budget, no grad assistant, no secretary” (personal communication, October 21, 2001). EPAA, I should add, is an online peer-reviewed journal that receives some 2,500 unique visitors a day from 70-80 countries (Glass, 2003). As you might imagine, we were greatly encouraged by how easy Glass made it all seem, both in gathering cost figures and then in convincing others what a sensible, viable idea open access is for scholarly publishing. We were still in the early stages of our efforts to determine publishing costs, and, of course, we did not see anything even close to Glass’ figure again. And in Glass’ case, it turned out that he had institutional support covering a portion of his own time, which is not all that unusual for a journal editor, as well as being able to tap into the university’s bandwidth and other infrastructure. But then the most successful of the automated repository models, the

Open Journal Systems

505

LHT 23,4

506

arXiv.org Eprint Service, in which authors file their own papers, and there is no reviewing or editing, operates with expenses that, according to its founder Paul Ginsparg, run to $9 a paper (Glanz, 2001). We went on to identify a small group of electronic journals that were spending in the area of $20,000 a year. For example, the Electronic Journal of Comparative Law had had its books reviewed by the accounting firm PricewaterhouseCoopers, which calculated that the Dutch open access quarterly was costing $20,084 annually (Bot et al., 1998; also see Fisher, 1999; Integration, 2002). A similar annual figure comes up with the BioMed Central journals, as a result of adding up the author fees it collects of $525 per published article (for most of its 100 or so open access journals although a few charge more). Some journals contract out their e-journal edition, and Highwire Press, at Stanford University Library, was initially charging between $35,000-125,000 to set up electronic journals, with ongoing operating fees for the e-journal of several thousand dollars a month (Young, 1997). Additional figures are to be found in the report on e-journals from Donald W. King and Carolyn Tenopir, who put the cost of an electronic edition of a journal to be $368 per page or about $175,000 per year for a typical journal (1998). Then, there was the Electronic Publishing Committee at Cornell University which estimated that it would take $2,700,000 to establish an electronic publishing program at the university, serving a number of journals, although a member of the team at Cornell later told me that what had been spent was more like $600,000 (Electronic Publishing Electronic Steering Committee (2003)). Finally, Reed Elsevier estimates that it has spent $360 million developing ScienceDirect, which hosts electronic editions of its 1,800 journals, with a continuing investment of $180 million for “developing new technologies”, and that’s apart, of course, from the editorial costs of running the journal (Davis, 2004). The different methods of calculation meant that there was no basis for comparing costs, but the breathtaking range to the figures spoke to nothing but the risks of moving a journal online. How could we ask editors and scholarly societies to consider open access as a viable option when we could not provide a reliable picture of what it cost to run an online journal? Well, we could tell those skeptical editors that it may cost them nothing, or more likely $20,000 a year, although it may run to more than $100,000, especially if there are a number of journals involved. It seemed to leave the entire open access journal publishing movement with a less than credible case to make with editors, scholarly associations, and funding agencies. The question of what it could end up costing to move a journal online would seem to discourage all but the diehard risk-takers and do-it-yourself adventurers from considering the open access model in making the move from print to online publishing. While Stevan Harnad (2003) has argued more than once that complete open access to the research literature can be achieved by having authors self-archive their published work in institutional repositories, even he acknowledges a place for open access journal publishing in achieving the goal of greater access. What if, we wondered, we could control one part of publishing’s financial model by reducing the journal’s software design and development costs to close to zero? After all, Tenopir and King (2001) use this software development point to argue that electronic publishing does not lead to great savings: “Electronic access avoids these costs [of printing and distribution], but has a substantial additional fixed cost – putting up full text on the web, staffing, software and other technology issues including design,

functionality, searchability and speed”. If we were going to provide support for open access publishing, and more generally make the case for containing the cost of access, we needed to provide a way to reduce costs. Only by sharply containing costs could journals begin to look at reduced revenues, whether by offering open access to their online edition or by simply making their back issues free (forsaking reprint revenue). We could do this by creating open source software that was specifically developed to manage and publish journals online. The software could be designed so that it called for no greater technical skills on the part of journal editors than were commonly found among university faculty today, namely word-processing, e-mailing, and web-browsing. This software could also keep publishing costs down by taking advantage of the technical infrastructure and server capacity already in place in most university libraries, which might well be willing to host such a system, given that as more and more libraries undertook this support (whether at a fee to journals or as a public service) would contribute to increased access to the research literature, and ultimately reduce their subscription costs. The open source model was, after all, proving itself with the software Eprints.org, developed at the University of Southampton, which a good number of institutions has installed for their faculty members to self-archive their research. Open source was proving itself the well-established alternative route with the operating system Linux, otherwise known as “the impossible public good” (Smith and Kollock, 1999, p. 230). The academic community continues to play a vital role in open source software development, following on Linus Torval’s beginnings with Linux in his work as a graduate class project in Finland. More recently, the Sakai cooperative has been formed among 44 institutions and is devoted to developing open source course management software, with the support of the Mellon Foundation and Hewlett Foundation (Young, 1997). So the Public Knowledge Project gradually switched gears, away from developing knowledge management web sites that increased and enhanced public access to educational and policy research. It moved into developing an open source, easily configurable, easily installable, software for managing and publishing journals. It sought new grants to do this, hired three undergraduate computer science students, and cut its teeth in the year 2000 on developing an open source conference system that would create an open access archive of the proceedings, as well as manage the conference web site. In November of 2002, 18 months after software development began on the journal software, Open Journal Systems (1.0) was launched in St. John’s, Newfoundland. OJS was built with support from the Social Sciences and Humanities Research Council of Canada and the Pacific Press Endowment at the University of British Columbia, with further support coming from the Max Bell Foundation, and the Catherine and John D. MacArthur Foundation. The funding was provided in the context of research and development, with the software development following a range of related research projects, from policymakers’ use of open access research to the potential of open access to contribute to the research capacities of universities in developing countries (Willinsky, 2005). The programming of OJS was supported as a part of the larger research program and was conducted by part-time undergraduate computer science students over an initial 18-month period resulting in the delivery of OJS 1.0 in November 2002 at a cost of $45,000, with another $110,000 over the next 31 months

Open Journal Systems

507

LHT 23,4

508

leading up to the release OJS 2.0, in May 2005. (These figures supersede, thanks to improved accounting procedures, previous figures presented on OJS costs). In January, 2005, UBC’s Public Knowledge Project entered into a parternship with the Canadian Centre for Studies in Publishing, led by Rowland Lorimer, and the Simon Fraser University Library, directed by Lynne Copeland, with the aim of providing ongoing support for OJS, as well as Open Conference Systems and the PKP Harvester. Simon Fraser University Library is providing a hosting and publishing support facility for journals wishing to subscribe to such services, while the Canadian Centre for Studies in Publishing will provide editorial training for systems such as OJS. While this is unlikely to make the ongoing development of OJS self-sustaining – at roughly $50,000 a year – the ongoing funding from institutions and grants has to be weighed against the benefits of this public good (as well as its ability to reduce publishing costs across all users at a net saving to journals and thus to libraries, in principle, which are well ahead of OJS’s ongoing costs). The development costs serve as a reminder that open source software is not free. The better part of that expense has gone into creating a system that was more than user friendly. It was designed to offer journal editors all of the necessary options required by the varying editorial standards followed by different disciplines, from journals in which authors select the editor to whom they wish to submit, to journals where multiple rounds of review by the same reviewers are standard. OJS is also carefully set up to assist those who have little enough experience with journal publishing. Establishing a new journal or helping a fledging one find its feet can, after all, support the development of local research and review capacities in areas of higher education where that has not been part of the academic culture, because of a lack of opportunities to participate. Too often, universities foster the attitude that work must appear in the highest ranked journals to count for anything. But without a series of intermediary steps up that steep academic ladder, and without journal experience with reviewing and editing, scholarly publishing can become an all-or-nothing career game that does little to foster opportunities for a new order in the global circulation of knowledge. The easy portability and use of OJS is intended to serve that larger global goal. Now that OJS has been in use for over two years, we have drawn on the experiences of many editors to continue to increase the flexibility and possible configurations of the system. It is currently being used, in whole or in part, in its original or modified form (it is open source), by over 250 journals to manage and publish online. OJS is also supported by contributions coming in from around the world, in the form of bug fixes, translated files for OJS (it is now available in five languages), and a subscription module, with an active Support Forum with close to a hundred registered members. There are journals using OJS to reduce the expenses for subscription journals and open access journals in the humanities, for example, that follow Gene Glass’ zero- budget tradition of scholarly publishing by relying on skilled volunteers for all of the critical roles in the publishing process (like editor, copyeditors, layout editors, and proofreaders) which are not about to be automated by systems like OJS. Installation OJS is designed to cover all aspects of online journal publishing, including the setting up of a journal web site; the handling of the author’s submission through peer review, and

editing; the management of issues and archives; the indexing and search capacities of the journal. The software can be downloaded from the Public Knowledge web site and installed on a web server with a Linux, Windows, or Unix operating system, running Apache, PHP and the MySQL database. This download-and-install approach is intended to enable local control of journal publishing, while still operating within a distributed system for indexing and system development. Most journal management systems provide a centralized hosting as part of their service contract, adding to the cost of operating the journal. More than a few of the journals using OJS have the software hosted on a university library or other institutional machine, in light of the benefits it gains from the growth of open access to research and scholarship. In the case of Africa, for example, UNESCO has agreed to host African journals using OJS as part of its African Network of Scientific and Technological Institutions program located in Nairobi. Once OJS is installed on a local server, it can be used to generate any number of journals from that site. Once a journal is created on the server, it is ready to be configured by the journal manager or editor who can do this by simply filling in a series of templates in the Setup section of the journal. The templates cover the journal’s basic details (title of the journal; principal contact; sections of the journal, etc.), as well as providing a place to post and manage journal policies, processes, and guidelines. Through this process, OJS creates a customized web site for managing and publishing the journal. With the web site in place, authors can submit their work directly to the web site; editors can drop in to journal’s workspace at the airport, using their laptops to oversee the review process; reviewers can pick up assigned papers and post their reviews; accepted papers are edited, laid out, published, and indexed all on the site. OJS is designed to enable a single editor to manage a journal and the journal’s web site. It can also support an international team of editors, with shared responsibilities for a journal’s multiple sections. The web site that OJS sets up serves as an editorial office for the journal, while the system sees to the labeling, filing, and tracking of all submissions, provides a work space for editors, reviewers, copyeditors, layout editors and proofreaders, as well as a workflow process for submissions that moves them through each of the necessary steps, ensuring that they each land on the right desktop at the right time in the editorial process. So when it comes to calculating the savings from using such a system, one can begin with real estate, and the prospect of not having to maintain an editorial office, with all of the associated furniture and overhead. Or if one already has such an office, there is the prospect of a sub-let revenue. There may be no bottle of wine in the OJS cupboard, but the virtual online editorial office is always open, always available with a complete set of records and materials, and can be reached from any computer that can form an internet connection. The editorial process OJS is intended not only to assist with journal publishing, but is also designed to demonstrate to editors how the cost of journal publishing can be reduced to the point where providing readers with “open access” to the contents of the journal may be a viable option. OJS reduces the clerical, management, and publishing costs of journals (see Table I). This was a necessary first step, of course, if there was to be any hope of journals being able to make their contents free for readers through some form of open access.

Open Journal Systems

509

Author

Submission

Automated and Assisted journal management

Software

Cataloguing storage

Not otherwise available

Postage, packaging, time Third-party indexing services

Printing services, time

Clerical time, copying, postage, courier, stationery, editor time

Savings

510

a) Manuscript, appendices, data, instruments, etc. are uploaded to journal in a variety of file formats b) Templates provided to assist author in indexing work Submission Editor a) Author is notified of submission receipt b) Submission is dated and queued for review c) File can be readily modified (e.g., remove author’s name) Peer review Editor a) Reviewer contacts, interests, and record maintained b) Reviewer contacted with title, abstract, date, etc. d) Review due date, with reminders, thanks, available e) Review progress tracked (and viewable by author) Peer review Reviewer Comments managed and editor contacted Editor review Editor a) Author notified with reviews (complete or excerpts) and access provided to marked copies b) Complete archival record of review process maintained Revisions Editor a) Back and forth with author and submission facilitated b) Paper re-circulated among reviewers, as needed Editing Copyeditor a) Link to editor and author, re submission queries Proofreader b) Link to Layout for proofreading changes Layout Editor Manages multiple formats (HTML, PDF, PS) with previews Publishing Editor a) One-click scheduling and ordering articles and sections b) Volume and number, special issue, assignment Distribution Editor Automated, email notification of contents to readers, authors, and editors Indexing Readers a) Automated harvesting of article metadata by open archives initiative engines, including citation indexes b) Articles linked to relevant items in open access databases, based on article’s keywords Interchange Readers/authors Comments to articles can be posted, and online forum maintained for continuing exchange on range of themes Archiving Host library or society Web host provides server maintenance, backup, and content migration to new systems Upgrading OJS Open source community continues to develop system

Agent

Table I. E-journal management systems savings (in relation to print journals) based on OJS

Stage

LHT 23,4

OJS management systems are structured around the traditional journal workflow required to move a submission through reviewing, and if accepted, editing and publishing, with records maintained of who is doing what and when (see Figure 1). OJS uses a prepared set of e-mails to contact the necessary people at each step, whether author, editors (managing, section and layout), reviewer, copyeditor, or proofreader. These e-mails, which are used to coordinate processes among editors, authors, reviewers, etc., contain the necessary information for each submission that is automatically filled in. The e-mail can be personalized by an editor prior to sending, except in such cases as automated reminders. To take an example of how a journal management system such as OJS works in action, consider the most common task of an editor, namely, assigning two or more reviewers to evaluate a manuscript for possible publication. The editor logs onto OJS through her internet browser, whether at the office, home, or airport (a cell-phone version of the program has yet to be created). On entering the journal’s web site, the

Open Journal Systems

511

Figure 1. Editorial workflow process for OJS

LHT 23,4

512

editor first comes to a table that sets out the current state of her assignments, with some submissions awaiting an overdue peer review, and others that have just arrived and need to have peer reviewers assigned to them. With the new submissions, the system has already notified the authors with a standard e-mail indicating that the manuscript was successfully uploaded to the journal, and inviting them to log in to check the progress of their submission. The editor goes to the Submission Review page for one of the new submissions and takes a look at the paper by downloading it to see if it is suitable for the journal and ready for review. Once satisfied on that count, the editor then clicks a Select Reviewer button. This takes the editor to a list of reviewers that indicates their areas of interest, the date their last review was assigned and completed, as well as how many reviews have been completed. The editor scrolls or searches for a suitable reviewer, or decides to enter a new name, before clicking the Assign button. The Assign button causes a window to appear, containing a prepared e-mail, addressed to the reviewer from the editor. This e-mail presents the paper’s title and abstract and invites the reviewer to visit the site and download the paper (or if the editor chooses, the submission is sent out as an e-mail attachment). Once the editor sends the e-mail, the name of the reviewer, along with the date the invitation was issued and the deadline date for the review are recorded on that submission’s Review page. All this can be accomplished in the time it might otherwise take to ask an editorial assistant to check when a certain colleague had last reviewed for the journal. The editor then moves on to select a second and possibly a third reviewer for the submission. And while the editor will devote whatever time saved, and then some, to assessing the reviews and providing helpful advice to the authors, the process outlined here needs to be compared to Fytton Rowland, 2002 determination that the current average cost of peer review process for journals is $400 per published paper. In the example presented in Figure 2, which is drawn from a demonstration journal we have set up, the section editor (Rory Hansen) is conducting the peer review of a submission entitled, “Understanding in the Absence of Meaning: Coming of Age Narratives of the Holocaust.” In this case, Reviewer A (Simon Casey) has pasted in a Review and submitted a Recommendation for the section editor to consider (in a rather unrealistic turn around time). The section editor has just selected a Reviewer B (Eunice Yung), but has yet to send out the Request e-mail inviting Reviewer B to enter the journal web site and conduct the review. When both Reviews and Recommendations are in, the section editor can import the Reviews into the Editor/Author Correspondence box, edit them and add an explanation of the editorial decision arrived at for this submission. If revisions are invited from the section editor, the author is able to upload a revised version of the paper, which could be entered into a second round of reviews, if the Section Editor has decided that the submission should be resubmitted for review. OJS maintains a log of all e-mails sent, reviews filed, and selections made as part of its record of the editorial process. The publishing options for the journal using OJS include the full range of article formats, including PDF, HTML, and Postscript. The careful formatting and layout of these articles is not something, as noted above, that OJS has automated. The preparation of the galleys in one or more publishing formats must be done by someone who has the appropriate skills and access to the software (e.g., Adobe Acrobat). As

Open Journal Systems

513

Figure 2. A screenshot of a submission review page with OJS

with copyediting and proofreading, there are no shortcuts for these steps when it comes to producing quality copy for the journal. What OJS does, however, is allow resources to be concentrated on such tasks, by taking good care of the ordering, alerting, and organizing of these processes. OJS can publish the contents of the journal in a standard issue format, with 10-12 items, or the editors can decide to publish each article as soon as it has completed the editorial and layout process. This continuous publishing approach is something which journals are doing more often now, taking advantage of the new technology, rather than slavishly following what are becoming the anachronisms of the earlier form (when it made economic sense to bind articles together and issue them in a set). We are also addressing the issue of journal preservation through the use of Stanford University Library’s LOCKSS (Lots of Copies Keep Stuff Safe) system, which provides “a persistent access preservation system” involving a number of cooperating libraries. This also speaks to the approach, mentioned above, of research libraries cooperating on the distribution of journal hosting and publishing responsibilities, an idea that needs to be explored further in terms of its potential ability to reduce overall costs to

LHT 23,4

514

the libraries. Elsewhere, I present the closely related case for libraries and scholarly associations entering into an open access publishing cooperative, while the economic feasibility of publishing cooperatives is also being investigated by Raym Crow on behalf of SPARC (Willinsky, 2005). Journal indexing On submitting a paper to the journal’s web site, the author is asked to provide the appropriate indexing information or metadata. This does mean additional work for the author, but compared to the old days of just a few years ago when an author making such a submission had to make multiple copies, prepare a letter, and post it to the journal, it results in a saving in time, energy and cost (if somewhat offset in developing countries by the price of using an internet cafe´ which faculty members often have to do). The principle at issue is again one of moving energy from clerical tasks to those that contribute to the quality of the published work. Thinking about the indexing of one’s work does that, compared to photocopying it, as it gets authors to think about how they position their work within the larger field. Of course, professional indexers and cataloguers would do a far better job of classifying a work than most authors. However, increasing access to the research literature entails increasing access to indexes and in light of how much indexing services charge libraries, there exists a need for an alternative to professional indexing, especially for universities in the developing world (Willinsky, 2005). The actual extent of author indexing is a somewhat experimental aspect of OJS. The editors can determine which indexing elements or metadata to include in their journal, and they can provide authors with relevant examples from their own field (with links to classification systems or a thesaurus) to guide the indexing process. The indexing in OJS adheres to the Open Archives Initiative Harvesting Protocol, which is based, in turn, on the Dublin Core Metadata Initiative that utilizes 15 elements. OJS supports an extended form of the Dublin Core, allowing journals to have authors index, for example, characteristics of research subjects (such as age and gender), as well as index the research methodology or method of analysis used by the work (see Table II). As the web grows and the research literature along with it, greater precision of indexing can provide some protection against the threat of sheer information overload. One reason for thinking that research libraries are good places to have journal systems like OJS hosted is that the library is also the home of indexing and information science expertise which could contribute to this aspect of publishing, if only by occasionally reviewing authors’ indexing patterns, and providing useful advice and guidance. The goal is to afford more readers accurate searching among electronic research resources, without completely eliminating serendipity. It is also a way to create more inclusive and immediate indexing than is otherwise available from commercial indexing services (Willinsky and Wolfson, 2001). Reading tools A second experimental aspect of OJS has been focused on improving the design of the reading environment which online journals create for the content they publish. It is true that the most common way of reading articles found online is still to slide the cursor over the print button. However, readers are slowly discovering the advantages of reading online, even as the quality of screens and the portability of the machines

Dublin Core

OJS indexing for scholarly journals

1. Title 2. Creator 2. Creator 3. Subject 3. Subject 3. Subject 4. Description 5. Publisher 6. Contributor 7. Date 8. Type 8. Type 9. Format 10. Identifier 11. Source 12. Language 13. Relation

Title of article, book review, item, etc Author’s name, affiliation, and e-mail Biographical statement Academic discipline and sub-disciplines Topics or keywords Disciplinary classification system, if available Abstract of article Publisher or sponsoring agency (name, city, country)a Funding or contributing agencies to the research When paper was submitted to journala Peer-reviewed, non-refereed, invited; article, book, review, etca Research method or approach HTML, PDF, PS (file formats)a Universal Resource Indicatora Journal title, volume (issue)a Language of the article Title and identifier for document’s supplementary files (e.g., research data, instruments, etc.)a Geographical and historical coverage Research sample (by age, gender, ethnicity, class) Author retains copyright, granting first publication rights to journal (default version)a

14. Coverage 14. Coverage 15. Rights

Note: a Items that are automatically generated by open journal systems, with all other items entered by the author, on submission of article, and later reviewed by the editor

improve. Our goal is to take advantage of online resources and tools to improve the quality of critical engagement with this literature while it is online. These improvements have to be made, however, without adding significantly to the journal’s costs or the editor’s workload – given the exigencies of open access publishing and archiving – and they cannot get in the way of the primary readership of the journal, the researchers themselves, even as these tools provide additional support for less experienced readers of this research (which was the original impetus of the Public Knowledge Project). In seeking to improve the reading environment, we have turned to the research on learning how to read. And we set out to build on the excellent model established by Highwire Press, PubMed and others sources, by extending the typical set of links that these systems provide for each article with the aim of creating a richer context for reading journal articles. The Highwire journals, for example, provide support for expert readers, whether with links to related articles in the same journal or to works by the same authors. We set out to build Reading Tools, as we call them, which would assist the wider range of readers who will follow on the heels of open access[3]. The Reading Tools sit just beyond the margins of the article, looking much like a traditional paper bookmark (see Figure 3). At this point, we have developed 20 sets of Reading Tools to cover as many of the academic disciplines and broad fields as possible, depending on the availability of open access resources and databases. Each set typically provides readers with 10-15 links to other open access sites and databases. The journal’s editors can reconfigure the Reading Tools to direct readers to further

Open Journal Systems

515

Table II. The use of the Dublin Core metadata in the indexing of materials published in journals using OJS

LHT 23,4

516

Figure 3. Reading tools for use with OJS journals in the field of education

relevant sources. Figure 3 presents one of the current prototypes for the Reading Tools, using the article introduced earlier from the field of education as its example. While we have only begun testing whether such tools can help a wide range of readers read research, the initial responses to the tools from readers in the design phase have been positive (Willinsky, 2004). Our studies are focusing on whether the tools can contribute to comprehension, evaluation, and utilization of research among the public, related professions (such as teachers and physicians), policymakers, and researchers. The Reading Tools in the design shown here start off by answering a question that troubles many readers of information online, as it identifies whether the article being read is peer-reviewed or not, with a hyperlink to an explanation of what the peer-review process is about. Also close to the top of the Reading Tools is a link that reads “View the item’s metadata”. A click on it reveals the study’s indexing information, including as discussed above, its discipline, keywords, coverage, method, and sponsor. This addresses another concern identified in the research on reading, namely that inexperienced readers have difficulties identifying the significant concepts – separating core ideas from the noise – around which to associate related points and arguments (Alexander et al., 1994). Then, moving down the Reading Tools, with To look up a word, readers can double click on any word and send it to one of two free online dictionary services. There is also a set of links for finding items that are related or relevant to the article being read. These include Author’s Other Works, Research Studies, and Online Forums. To click on one of these presents the reader with a choice of relevant open

access databases. With Author’s Other Works, for example, the author’s or authors’ names are fed into an open access database, such as ERIC (the US Federal Government’s Education Resources Information Center) in the field of education, with abstracts or articles and lists these other works in a window for the reader to consider consulting. With Research Studies, and Discussions and Forums, the relevant open access databases that we have identified in advance are searched using the first two keywords provided by the author of the article to ensure relevant materials come up. Before any search, the reader can change the key words provided by the author to further focus the search. The reader can then use the articles that come up from search for related studies or author’s other works as points of comparison or studies to pursue in themselves. Through the Press and Media Reports and Government web site links, readers are also led to see that the context for reading research is not always other research, but can be other relevant public materials that give a contemporary and applied context to the work being read. Now the risk with such reading tools is that the reader will be overwhelmed or at least sufficiently distracted that the value of access to this research will be diminished. This may be all the more so for those with little experience reading this material, while the expert will see it as no more than another nuisance associated with online reading. Our preliminary investigations with policymakers and complementary healthcare workers suggest that it provides them with a greater sense of the research’s value and contribution to their understanding. Still, as we say in this business, more research is needed on the reading of research, especially in light of this new openness. What should be clear is that reducing publishing costs and enhancing publishing efficiencies is only part of the case for a system such as OJS, just as toll-free access should be only part of the case for open access to research and scholarships. What is no less important in both cases is using what we know about reading and publishing, about access and learning, to extend the circulation of this knowledge. Conclusion In terms of where OJS are headed, development of the program continues apace, with the 11th upgrade, version 2.0.1, released in July 2005. OJS can now support multiple journals from a single site, as well as offer PDF searching, a complete help manual, multiple rounds of reviewing, automated reminders, reviewer ratings, and a host of other features. The community of journals deploying OJS continues to grow, with over 140 registered users on the PKP Support Forum, and further translations of OJS are underway within that community. While Simon Fraser University Library has taken over the technical development and support of OJS, in conjunction with its journal hosting service, we remain committed to OJS and the related Open Conference Systems as open source software for use worldwide. Our attention continues to be focused on ways of improving the contribution of such systems to university research capacities and research cultures in developing countries, as well as supporting the public quality of open access. To that end, we are working with universities and organizations in Ghana, South Africa, India, and Pakistan on publishing initiatives. We are also looking into ways for increasing the use of XML in the publishing process, in collaboration with our user community, for layout, citation checking, and multiple output formats, as well as improving compatibility with related systems, such

Open Journal Systems

517

LHT 23,4

518

as institutional repositories. We hope to see our work with the open access Reading Tools move beyond OJS, by exploring how they might work as part of a general browser or a library application. Finally, this work has led us to explore, in association with Mikhail Gronas at Dartmouth College, new possibilities for the increasingly popular blog, as a dynamic and responsive space for faculty and graduate students to develop rough research ideas and working papers, prior to formal submission to peer-reviewed journals. At every point, the goal of this continuing program of research and development is to increase the scholarly and public quality of research. Certainly, the Public Knowledge Project’s own research program will remain focused on the impact and contribution of increased access to knowledge, in its efforts to better understand the potential of this new publishing medium. Notes 1. For more information, see the Public Knowledge Project (http://pkp.ubc.ca) and OJS (http:// pkp.sfu.ca/ojs). 2. See Hyperjournal (www.hjournal.org/); eFirst XML (www.openly.com/efirst/), DPubS (http// dpubs.org), AllenTrack (www.allentrack.net), and Bench . Mark (benchpress.highwire. org). 3. A working version of the Reading Tools, integrated into OJS, is available at: http://pkp.sfu. ca/ojs/demo/present/index.php/demojournal/issue/current References Alexander, P.A., Kulikowich, J. and Jetton, T.L. (1994), “The role of subject-matter knowledge and interest in the processing of linear and nonlinear texts”, Review of Educational Research, Vol. 64 No. 2, pp. 201-52. Bot, M., Burgemeester, J. and Roes, H. (1998), “The cost of publishing an electronic journal: a general model and a case study”, D-Lib Magazine, p. 27, available at: www.dlib.org/dlib/ november98/11roes Davis, C. (2004), Scientific Publications: Uncorrected Transcript of Oral Evidence To Be Published As HC 399-I House of Commons Minutes of Evidence Taken Before Science And Technology Committee, United Kingdom Parliament, London, available at: www. publications.parliament.uk/pa/cm200304/cmselect/cmsctech/uc399-i/uc39902.htm (accessed April 8, 2004). Electronic Publishing Electronic Steering Committee (1998), Electronic Publishing Steering Committee Report on Electronic Publishing Strategies for Cornell University, Cornell University, Ithaca, NY, available at: www.library.cornell.edu/ulib/pubs/ EPSCFinalReport1998.htm (accessed September 29, 2003). Fisher, J.H. (1999), “Comparing electronic journals to print journals: are there savings?”, in Ekman, R. and Quandt, R.E. (Eds), Technology and Scholarly Communication: The Institutional Context, University of California Press, Berkeley, CA, pp. 95-101. Glanz, J. (2001), “Web archive opens a new realm of research”, New York Times, May 1. Glass, G. (2003), “Education policy analysis archives activity”, paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. Harnad, S. (2003), “On the need to take both roads to open access”, BOAI Forum Archive, available at: http://threader.ecs.soton.ac.uk/lists/boaiforum/130.html (accessed September 28, 2003).

Integration of IDEAL with ScienceDirect (2002), “Integration of IDEAL with ScienceDirect”, Scholarly Communications Report, Vol. 6 No. 6, p. 3. Klinger, S. (2001), “Are they talking yet? Online discourse as political action in an education policy forum”, PhD dissertation, University of British Columbia, Vancouver. Rowland, F. (2002), “The peer-review process”, Learned Publishing, Vol. 15 No. 4, pp. 247-58. Smith, M.A. and Kollock, P. (Eds.) (1999), Communities in Cyberspace, Routledge, London. Tenopir, C. and King, D.W. (2003), “Lessons for the future of journals”, Nature, Vol. 413 No. 6857, available at: www.nature.com/nature/debates/e-access/Articles/tenopir.html (accessed September 29, 2003). Willinsky, J. (2004), “As open access is public access, can journals help policymakers read research?”, Canadian Journal of Communication, Vol. 29 Nos 3/4, pp. 381-94. Willinsky, J. (2005), The Access Principle: The Case for Open Access to Research and Scholarship, MIT Press, Cambridge, MA. Willinsky, J. and Wolfson, L. (2001), “The indexing of scholarly journals: a tipping point for publishing reform?”, Journal of Electronic Publishing, Vol. 7 No. 2, available at: www.press. umich.edu/jep/07-02/willinsky.html Young, J.R. (1997), “HighWire Press transforms the publication of scientific journals”, The Chronicle of Higher Education, May 16. Further reading King, D.W. and Tenopir, C. (1998), “Economic cost models of scientific scholarly journals”, paper presented to the ICSU Press Workshop, Keble College, Oxford, March 31-April 2, available at: www.bodley.ox.ac.uk/icsu/ (accessed September 29, 2003).

Open Journal Systems

519

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

Using open source to provide remote patron authentication

520

Jackie Wrosch Detroit Area Library Network, Detroit, Michigan, USA

Received 7 June 2005 Revised 17 July 2005 Accepted 7 August 2005

Abstract Purpose – To develop an open-source remote patron authentication system to replace a problematic, proprietary vendor product. Design/methodology/approach – Functional requirements were developed using the vendor product as a model with additional requirements determined by the libraries planning to use the application. Using PHP on Apache web server with a connection to our ILS database on Sybase, a flexible system that can be configured to the local libraries requirements was created. Findings – Overall, the new system has been welcomed and the most widespread problems we encountered have been resolved. Most importantly though, using an in-house system empowers libraries to introduce enhancements and bug fixes as soon as possible and not rely on a vendor’s schedule for doing so. Research limitations/implications – A project like this would not be possible if the ILS database was proprietary and inaccessible from other, open-source technologies like PHP or the data structures were not published. Practical implications – The remote patron authentication system is only one possible use of these technologies. Other applications using ILS data could be developed. Originality/value – Using PHP with Apache and a connection to the ILS database, the necessary functionality was retained and added other features that improved reliability, configurability and cross-browser usage. By embracing this approach, the authors also retained control on its future development and improvement. Keywords Library automation, Library networks Paper type Technical paper

Library Hi Tech Vol. 23 No. 4, 2005 pp. 520-525 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636319

Introduction Using open source technologies to improve to improve traditional services as well as develop alternatives to proprietary vendor products can empower libraries by giving them more control over the product’s functionality. The Detroit Area Library Network (DALNET) has taken advantage of the HTML-embedded scripting language PHP to develop web applications to do just that. Most notably, a locally developed remote patron authentication system replaced the one provided by our ILS vendor. Established in 1985, DALNET (www.dalnet.lib.mi.us) is a 22-member, multi-type library consortium. Academic, public, medical, museum and law libraries of varying sizes are all members of the consortium. Its primary purpose has traditionally been to provide an ILS to each member library and staff to support its operation. DALNET employs a staff of five, including a Director, Systems Analyst, two Systems Librarians and an Office Manager and is governed by a Board of Directors which includes a representative from each member library.

Since 2001 DALNET has offered patron authentication service for its member libraries. This service provided by an add-on product, Remote Patron Authentication (RPA), purchased from our ILS vendor, was first used to offer remote access to subscription databases for library patrons. Beginning in 2002, this product was also used to provide authentication and borrower information to a user-initiated resource sharing system known as Michigan Library Exchange (MiLE) running URSA from Dynix. Although filling a critical need, we experienced many problems with the product including the instability of the processes running on our server, conflicts with leftover cookies and subsequent logins, JavaScript functions behaving differently in different browsers and browser versions, as well as an ongoing defect that allowed expired patrons to successfully login. It is due to these issues that we replaced our purchased proprietary product with an in-house one that uses the open source technologies PHP and Apache. Defining requirements Even with its problems RPA did provide some essential capabilities that we did not want to lose. It allowed each library to configure what constituted a successful login. Libraries could define which borrowers were permitted to access a resource, for example if a patron owed over a certain amount in fines, had too many items out or a history of lost items they could be blocked. Although each parameter was configured by individual libraries, the vendor selected the possible criteria. The one unconfigurable aspect that we found especially problematic, particularly for compliance with subscription database licenses that only permit access to current students and faculty, was that we could not block expired patrons from logging in. In addition to borrower specific login criteria, RPA also permitted configuration of internal IPs that would skip the login process altogether and proceed with IP authentication by the vendor. As part of a successful login, RPA uses a “success URL”. This is the URL to which the patron is redirected after a successful login. It could be a subscription database URL that validates the referrer URL for authentication on the vendor side, or it could be a URL with embedded login information or a vendor provided login script. We could certainly debate the merits, or lack thereof, that these vendor methods of authentication provide, however, they are offered by many of our libraries’ subscription database vendors and provide a method of access we did not want to lose. Finally, our vendor product provided ILS-stored borrower information, beyond that supplied by the user at login, to our resource sharing system. After a successful login, a patron’s name, type, e-mail and barcode were all added as name value pairs to the resource sharing system success URL. Our in-house system would need the ability to extract this information from our ILS database (Dynix Horizon on Sybase) and add it to the success URL. Using both our vendor product and our lists of issues with it, we were able to determine what exactly our DALNET Patron Authentication service would do; we wanted to reproduce our vendor authentication product by validating against our patron database while improving reliability, configurability and cross-browser functionality. For these reasons, we chose to use PHP on Apache with a direct connection to our ILS Sybase database. Apache is a famously stable product which we have used to provide many other web-based services and have never experienced any

Remote patron authentication

521

LHT 23,4

522

stability problems with its processes. Any other web server that supports PHP could also be used. PHP allows us to do server side processing and return straight HTML to the user. This eliminates many of our cross-browser functionality issues that occurred when RPA used client-side JavaScript for a large part of the validation process. Also, it is flexible enough to allow us to configure for individual libraries and their particular setup. And although Sybase is not open source technology, it is one of the database platforms used by our ILS vendor. We have also used our PHP/Apache setup with connections to both SQL Server and MySQL to provide authentication for other ILS databases; what is important is understanding how the borrower information is stored within it. Application functionality The user comes to the authentication web page where five major tasks happen. First, the library is determined from a parameter passed in by the user, this determines which configuration files to use and which database to validate against. Next, the IP from where the request is coming from is checked. If it is an internal IP as defined by the library, this is indicated and stored as a variable in the session. If there is an incoming barcode passed by the user, this is validated against the ILS database and the successful login configuration determined by the library. This information is then stored in another variable in the session. It is important to do both the barcode and IP checks, because not all resources, like our MiLE system, allow IP authentication. Of course, if no defined resources use IP authentication, this step could be bypassed. After this, if the patron request contains a selected resource, the success URL is pulled from a configuration file and any special processing required is done, for example appending a barcode or e-mail address to the URL. Finally, the page displays the appropriate content depending on what happened in the previous steps (Figure 1). The functionality of this system is contained in the PHP code in the file login.php and additional configuration files that define aspects of each library’s setup. The first of these files is libs.conf that defines the name of the library/institution and the directory in which the rest of the configuration files are stored. The first request to login.php must contain a library ¼ parameter or no other processing will occur and the user will be prompted to choose a library from a list of those available. Once the library is received through a request, it is stored in a session variable and is subsequently accessed through the session. In the next step, login.php will retrieve the IP from the request via the server [remote_addr] variable and compare that to the IPs found in the ips.conf file. IP ranges can be stored as regular expressions in this file and are compared accordingly. A

Figure 1. Data flow

session variable will then be set and accessed when determining what to display later in the session. If a request contains a barcode value, login.php will then proceed to check this against the parameters set up for the particular library. It will connect to the ILS database and query it to see if the barcode exists. It will then check that the borrower type corresponds to a valid one found in the btypes.conf file. It will also check that the borrower is not expired, does not have too many items out, or owe too much money based on data retrieved from the ILS database. Of course, any one of these checks could be removed and others added. These are all defined through the PHP file which can be configured by the local library to validate on what they feel is important. After completing this series of checks, a session variable will be set and if needed an error message returned to be displayed to the user. If either the IP check or the barcode login has succeeded and login.php receives a resource value in the request, the following step will proceed. The resource value corresponds to a unique identifier defined by the library in the rs.conf file. In addition to the unique identifier, rs.conf includes a name for the resource, a success URL and a vendor authentication type code that is used during the display. At this point, if the incoming resource parameter is found, the success URL is stored in a variable; if not, an error indicating an invalid selection is returned to the user. If any special processing is needed to complete the success URL, for example the patron’s name needs to be added, this will occur now. This special processing is something that would be configured by the local library for each resource that requires it. After all of these steps, login.php uses the responses it has received to determine what to display. If there is an error of any kind it will display it first. If no library has been selected or the library was not found in the libs.conf file, a dropdown menu of available libraries is displayed. Next, if the barcode login has not been successful, login.php will display a login box. If either the IP or barcode check has succeeded and a valid resource selected, that resource will open in a separate window. Finally, login.php will display any other available resources based on the level of authentication in the current window. It will use the vendor authentication type code in the rs.conf file to determine how exactly to do this. For example, a code of ”0” means that a barcode login is required and only a successful barcode login will receive a link to this. An IP authenticated user will see the resource name with the message that a barcode login is required. A vendor authentication type code of ”R” means that the vendor uses a Referrer URL and the link will display the success URL in the page. Any other code will resubmit the resource code to login.php, revalidate and then open another window with the resource success URL. Assessment DALNET libraries began using this service in September 2004, we have yet to receive our first report that the service is unavailable. Previously with RPA, we needed to run a script to validate that all the required processes were running and had not mysteriously stopped. This script is no longer required. Even with the checking script, there were times when connections would drop and a manual restart was necessary. Now, if our database connection fails, the next request to login.php will re-establish it. The stability of our in-house product has been much appreciated by all our libraries and on-call staff who no longer receive calls late at night that the service is down.

Remote patron authentication

523

LHT 23,4

524

We experienced no real issues with connecting to our vendor ILS database or using the data from it. There are two reasons for this. PHP has an interface to connect to the Sybase product used by our ILS vendor for the backend database and the structure of the data is not proprietary. We did not receive any assistance from our vendor during this project; however the data structures that have always been available to us were used extensively. For a vendor that uses a proprietary database of its own and/or did not publish its data structures, a project like this would not be possible. The leftover cookies that conflicted with subsequent logins have been removed and each session begins cleanly. Users experiencing these problems in the past were led through a complicated series of steps that required them to not just use the “Delete Cookies” function from their browser, but to find the actual cookie files in their temporary directories and remove them from there as well. The session and the login will stay active for the time the browser window is open, but if a user logs out, closes their browser window or enters the authentication system by passing in a library parameter, all session variables are removed. We have experienced a few issues with different browsers and browser versions behaving differently, but not nearly the number we did previously. A small amount of JavaScript is used for opening a new window for the resources; a small tweak to it resolved most of the complaints we received. The expired patron issue that was so problematic for many libraries complying with subscription database licenses is no longer a problem. All libraries have configured this option as part of their successful login definition. Ironically, most complaints that we received from patrons that the system is not working were because the patron record was expired or not yet entered into the system. Allowing the library to completely define the successful login makes this setup particularly powerful, you are only limited by the data available in your ILS. Libraries are no longer dependent on a vendor-selected list of criteria, they are only limited by the data they have. Two issues related to the opening of a new window for the selected resource have surfaced. First, patrons who use pop-up stoppers required instructions on how to permit pop-ups for this particular site or how to use the keyboard commands to select resource links. The other problem was when using this setup with virtual reference (VR) software, the opened resource window could not be co-browsed through the VR software. To resolve this, a special version of the authentication system was added that opens the selected resource in the current window. To receive the list of additional available resources, users need to return to the authentication page, however it does allow both the reference librarian and users to access it through the VR software. This version of authentication is not publicized, reference librarians switch to it when needed. Both use the same configuration and display files. The VR version has also been used by the few patrons unable to configure their pop-up blockers to allow the new window to open. Additional uses After configuring this system for 13 DALNET libraries, we were asked to configure it for another local library participating in the MiLE project. The library ran the same ILS but with a SQL Server database as the backend. We were able to configure the same setup for them with no changes to the login.php code; it was completely portable between the databases.

Another academic library that was unable to use their vendor authentication product or directly query against their ILS database also used our application. They are able to import data from their ILS database into a MySQL database which the system validates against. And although this is not the most efficient method because it requires moving data from one source to another and not querying against the most current data; the import step can be automated and run as frequently as the library feels is necessary. Conclusion Open source technologies give libraries the opportunity to extend traditional services in innovative ways as well as reduce their dependency on vendor products. It empowers them to add features and correct the flaws that might otherwise never be addressed. DALNET has used this approach to successfully replace their vendor remote patron authentication product with one developed locally. Using PHP with Apache and a connection to our ILS database, we retained necessary functionality and added other features that improved reliability, configurability and cross-browser usage. By embracing this approach, we also retain control on its future development and improvement.

Remote patron authentication

525

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

THEME ARTICLE

Creating and managing XML with open source software

526 Received 7 June 2005 Revised 17 July 2005 Accepted 30 July 2005

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Eric Lease Morgan Digital Access and Information Architecture Department, University Libraries of Notre Dame, University of Notre Dame, Notre Dame, Indiana, USA Abstract Purpose – To review a number of open source XML applications and systems including editors, validators, native XML databases, and publishing systems; to describe how some of these tools have been combined by the author to create a specific system, for a specific need. Design/methodology/approach – An overview of XML is provided, a number of open source XML applications/systems are reviewed, and a system created by the author using some of these tools is described. Findings – The open source tools for working with XML are maturing, and they provide the means for the library profession to easily publish library content on the internet, using open standards. Originality/value – XML provides an agreed upon way of turning data into information. The result is non-proprietary and application independent. Open source software operates under similar principles. An understanding and combination of these technologies can assist the library profession in meeting its goals in this era of globally networked computers and changing user expectations. Keywords Extensible Markup Language, Computer software Paper type General review

Library Hi Tech Vol. 23 No. 4, 2005 pp. 526-540 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636328

Introduction In a sentence, the eXtensible Markup Language (XML) is an open standard facilitating a means to share data and information between computers and computer programs as unambiguously as possible. Once transmitted, it is up to the receiving computer program to interpret the data for some useful purpose thus turning the data into information. Sometimes the data will be rendered as HTML. Other times it might be used to update and/or query a database. Originally intended as a means for web publishing, the advantages of XML have proven useful for things never intended to be rendered as web pages. It is helpful to compare XML to other written languages. Like others, XML has a certain syntax. One on hand, the syntax is very simple. You really only need to know six or seven rules in order to create structurally sound – oftentimes called “well-formed” – XML documents. On the other hand, since XML is also intended to be read by computers, the rules are very particular. If you make even the slightest syntactical error the whole thing is ruined. Here are the rules: (1) XML documents always have one and only one root element. (2) Element names are case-sensitive. (3) Elements are always closed.

(4) (5) (6) (7)

Elements must be correctly nested. Elements’ attributes must always be quoted. There are only five entities defined by default (, , . , &, “, and ’). When necessary, namespaces must be employed to eliminate vocabulary clashes.

Below is a “well-formed” XML document in the form of an XHTML file. It illustrates each of the seven rules outlined above and serves as an example only. Elaborating on each of the rules is beyond the scope of the article (Figure 1). Creating structurally sound – syntactically correct – XML is only part of the picture. In order to make sense, XML documents also need to be semantically correct. The elements of XML documents must be combined with each other and the data they encode in a manner making sense and is understood. There are many sets of semantic rules, and they can be encoded in at least a few different forms. XML grew out of the SGML world, and consequently Document Type Definitions (DTDs) are a popular and well-supported way of encoding the semantic structure of XML documents. DTDs have their pluses and minuses. On the plus side they are common and rather simple to understand. On the minus side, DTDs are not written as XML documents and consequently require a different sets of tools to process. Ironic. Additionally, DTDs are not very good at data typing. There are no ways to differentiate between numeric data and character data. Furthermore, there is no way to dictate the shape of these data as dates, ranges, embodying specific patterns, etc. In

Managing XML with open source software 527

Figure 1.

LHT 23,4

528

Figure 2.

reaction to these limitations a few other methods for describing the semantic structure of XML have been introduced. The most notable of these are W3C schema files and RelaxNG schema files. It is suffice to say there are advantages and disadvantages of both, but they are both based on XML and resolve the deficiencies of DTDs. That being said, below is very simple XML file that includes a DTD. By reading the file it should not be too difficult for you to discern what it describes. Moreover, since the file is grammatically correct in terms of XML – it is well-formed as well as validates against a DTD or schema – it should not be too difficult for a computer to read and process as well (Figure 2). The final part of this introduction is rendering and transformation. While XML documents are readable by humans, they are not necessarily reader friendly, especially considering certain devices and outputs. Furthermore, it may be desirable to analyze, summarize, rearrange, and extract parts of an XML file. This is where Cascading Style Sheets (CSS) and Extensible Stylesheet Language (XSL) come into play. The strengths of CSS lie in its presentation abilities. It excels at layout, typography, and color. By creating styling characteristics (called “declarations”) with XML elements and combining them with XML files, the results are presentations of the original XML document that are easier to read. These presentations can be designed for various web browsers, printing, or even devices intended for speaking. For

example, CSS provides the means to align text, insert text in boxes, dictate the spacing between paragraphs, specify the use of various fonts, etc. Through a combination of supplementary technologies, most notably XSL Transformations (XSLT), it is possible to implement all of the functionality of CSS plus manipulate XML, sort XML, perform mathematical and string functions against XML, and thus “transform” XML into other (unicode/“plain text”) files. This means it is possible to take one XML file as input and through XSLT convert the file into a Formatting Objects (FO) document designed for printing, an HTML document designed for display in a browser, a comma-separated file destined for a spreadsheet application, an SQL file for importing into or querying a relational database, etc. The cost of all this extra functionality is a greater degree of complexity. Implementing XSL and its supplementary technologies is akin to programming. As an example, below is a simple XSLT file creating a rudimentary HTML stream summarizing the contents of the pets XML file (Figure 3). Equipped with a text editor, a relatively modern web browser (one that knows how to do standardized CSS and XSLT), and the knowledge outlined above it is entirely possible to implement a myriad of XML-based library collections and services. For example, it would be possible to mark up sets public domain documents using TEI (Text Encoding Initiative) and make them available as browser-friendly HTML documents on the web. The sets of TEI documents could be encoded as finding-aids using EAD (Encoded Archival Description) in order to guide users on the collection’s use. The TEI or EAD files could be transformed into SQL (Structured Query Language) or MARCXML files and then imported into databases for maintenance and/or searching.

Managing XML with open source software 529

Figure 3.

LHT 23,4

530

Open source tools for processing XML While using just a text editor, a web browser, and your knowledge is a great way to learn about XML, it is not very scalable. Using just these tools it would be difficult to create collections and services of any significant size. Developers understand this, and that is why a bevy to applications have been created to facilitate the creation and maintenance of XML data/information on a large scale. Some of these tools are described in the following sections. Parsers Of all the things in the XML toolbox, I find XML parsers (validators) to be the most useful. These tools check your documents for well-formedness and make sure they validate against a DTD or schema. They are sort of like the spell checkers and grammar checkers of word processing applications, and, fortunately, they are much more accurate because the rules of XML are much more simplistic when compared to the “rules” of purely human written or spoken languages. Xmllint. Xmllint is an XML parser built from a C library called libxml2 and used as a part of the Linux Gnome project (http://xmlsoft.org/). (Gnome is a user interface for Linux.) Because libxml2 is written in C, and because great care has been taken to implement no operating specific features, it is known to work on just about any computer. You can acquire libxml2 in source form or pre-compiled binaries. Xmllint is run from the command line. Assuming a DTD is specified in the root of your XML document, a command like the following will validate your XML: xmllint – valid pets.xml. If the XML is well-formed and validates, then the XML will be returned in a parsed form. If the XML does not validate, then the XML will be returned in the parsed form and a message will describe what the parser found and what it expected. Sometimes these messages can be cryptic, but all the validators return rather cryptic results. Libxml2 (and consequently xmllint) will also validate XML against external DTDS and XML schemas (both W3C and RelaxNG schema files). If your XML takes advantage of XInclude functionality, then xmllint will process these as well. Finally, a number of other libraries/modules have been written against the libxml2 library allowing people to use libxml2 functionality in Perl, Python, or PHP scripts. XML::LibXML2 is the Perl module implementing just this sort of thing. Of all the tools listed in this article, xmllint is the tool providing the biggest bang for its buck. MSV. While not necessarily open source, but freely available for download and complete with source code, Sun Microsystems’ Multi-Schema XML Validator (MSV) is a decent XML validator (www.sun.com/software/xml/developers/multischema/). Written in Java, this tool will validate against a number of grammars: Relax NG, TREX, and XML DTD’s. Once you’ve gotten Java installed and your CLASSPATH environment variable set up, you can type java -jar msv.jar pets.dtd pets.xml to validate a file named pets.xml. Simple and straight-forward. Since it is written in Java it should run on any number of operating systems. MSV does not seem to have been updated for a couple of years. xerces-c. Like libxml2, xerces-c is a C library for parsing XML. It is supported by the Apache Foundation (http://xml.apache.org/xerces-c/), and provides the means for creating binaries for specific purposes as well as creating hooks into interpreted languages such as Perl. XML::Xerces is an example. The process of building the libraries is non-standard but functional. The same is true for building the sample

applications. One of the more useful is StdInParse. By piping to StdInParse an XML file, this application will read the data and report errors it finds. An example invocation includes./StdInParse , ./pets.xml. With a number of optional flags it will check namespaces, schemas, and schema constraint. Even considering these features, the xmllint application is still more funcational. At the same time, it should be kept in mind that StdInParse is an example application. Xerces-c is really a library for the C programmer, not a set of binaries. XML::Xerces, a Perl module, is built against this library. Implementers need to know how to read the C instructions in order to make the most of the module. Editors There are quite a number of XML editors but most of them are not open source. I don’t know why. Maybe it is because a “real” XML editor has to not only has to provide the ability to do basic text editing tasks but it also needs integrate itself with XML, and if this is the case, then you might as well piece together the tools you need/desire instead of implementing a one size fits all solution. Furthermore, XML is not necessarily intended for display, so it is not going to work well in a WYSIWYG environment. jEdit. JEdit is a pretty nifty XML editor. Written in Java, it should run on just about any computer with a Java Virtual Machine installed (www.jedit.org/). The interface is a bit funky, but you can’t hold that against it since it is trying to play nice with three different user-interfaces: Windows, Linux, and Macintosh. Given a DTD or schema, jEdit is smart. It will examine the given XML grammer, and as you start typing it will create a list of possible elements available at that location in your document. If your element includes attributes, it will create a list of those as well. You can then select the elments and attributes in order to reduce the amount of typing you need to do as well as the number of possible mistakes you can make. If you import a file for editing and/or when you save a file you are editing, jEdit will validate your document and report any errors you have created. Very handy, and gets you away from using a parser like xmllint or Saxon (described below). JEdit also supports XSLT. Given an XSLT stylesheet jEdit will transform your document, again, without using xsltproc or Saxon. JEdit is really a text editor, not only an XML editor. Therefore, it contains functions for creating markers in your text, wrapping text, extensive find/replace functions, macros, word counting, etc. It is able to provide the XML functions through sets of “plug-ins” and there are dozens of other plug-ins to choose from. JEdit is an example of what Java was originally designed to do. Write once. Use many. Editing XML with jEdit can be a joy if you are willing to use an interface that you may not be accustomed. emacs As you may or may not know, emacs is more like a computing environment as opposed to just an editor. If you use emacs regularly, then you probably think it is great. On the other hand, most people find the innumerable key-stroke combinations, modes, and buffers very difficult to learn and remember. For those who like emacs and want to edit XML, first there is psgml (http://sourceforge.net/projects/psgml/). Psgml will read DTDs and allow you to edit XML files against them. Like JEdit, it includes options for collapsing and expanding elements, selecting elements for insertion, and overall

Managing XML with open source software 531

LHT 23,4

532

validation. Most, if not all, of the operations psgml can perform are located in the menus. This make things easier to remember, but also makes the interface more cumbersome. Nxml is another XML editing mode for emacs (www.thaiopensource.com/ nxml-mode/). By default it knows how to validate against DocBook, RDF, XHTML, XSLT DTDs/schemas. You can also specify a RELAX NG schema of your own design. It too will list elements that are valid in specific parts of your document. It color codes your XML, unlike psgml. Nxml can be configured to constantly validate your documents. When errors occur you can place your cursor over the highlighted error and nxml will give you hints on how to fix them. Nice. If you specifically need/want to edit TEI files, and you are an emacs fan, then consider TEI Emacs (www.tei-c.org/Software/tei-emacs/). Put together by the folks of the TEI Consortium, this (RPM and/or Debian) package will install psgml, nxml, as well as the TEI schema and some XSLT stylesheets. Databases Because of the highly structured nature of XML files, the use of XML as a technique for storing data and information is often explored. These explorations eventually become “native” XML databases – database applications storing XML files as whole chunks and providing database-like functions against them. These functions include global find/replace routines, reporting, importing, and exporting. These are interesting explorations, but they do not seem to have caught on in a really big way with the computing community. eXist. EXist is a native XML database written as a set of Java archive (.jar) files (http://exist.sourceforge.net/). These files are combined together in a number of ways to create the database application. For example, there is a rudimentary web interface, but there is a nice windowed/desktop interface too. This means you can install eXist as a Java servlette. You can start the client a second way, and access it through a web interface on your own host. Or you can use the fire up the windowed client application. To use eXist you create “databases” in the form of directories on your file system. You then import XML files, of any type, into the directory. Once there, you are expected to write and run XQuery scripts againt the database. (XQuery is, more or less, an enhancement of XPath with the addition of conditional statement such as if-then constructions. In this regard, it is similar to XSLT, but XQuery scripts are not XML files.) After you find things in your database (which can be entire XML files, text within elements, or the returned results of functions) you have the opportunity to export the data or transform it into something else. Like a lot of windowed Java things on my machine, eXist was not very snappy. It requires you to have a good understanding of XPath and/or XQuery in order to use it effectively. Unless you don’t have Java installed on your computer, there is no reason why it shouldn’t run on your machine and because of its graphical nature, it would be an excellent tool for learning XPath and XQuery. XSLT processors I find XSLT processors the second most useful tool in the XML developers toolbox. As mentioned in the introduction, these tools convert (“transform”) XML files into other plain text files whether they be other XML files, delimited files, or text files with no

apparent structure. Creating XSLT files is akin to writing computer programs in the form of XML. Xsltproc. Xsltproc is an XSLT processor based on the libxml2 C library mentioned above (http://xmlsoft.org/XSLT/). This processor implements all of the XLST standard funcations as well as a few extensions such as the document and include functions. To use xsltproc you feed it one or more options, an XSLT stylesheet, and your source XML files something like this: xsltproc pets.xsl pets.xml . pets.html. Such a command will combine the pets.xsl and pets.xml files and save the resulting transformation as pets.html. Using the – stringparam option you can define the values of XSLT parameters and simulate configurations for your stylesheet. Since xsltproc is an implementation created against a library, other languages can take advantage of this library and include its functionality in them. XML::LibXSLT is a Perl module doing just that and allows the programmer to include XSLT transformation functions in her applications. As an example, the following snippet of code combines an XML file with an XSLT stylesheet, stores the result to a variable called $results, and prints it. The process is confusing at first, but very handy once understood (Figure 4). xalan. Like xsltproc, xalan is a set of C libraries (and subsequent command-line application) providing the means for the programmer to include XSLT functionality into their program (http://xml.apache.org/xalan-c/). Like xsltproc, it requires a previously installed XML parser, in this case xerces-c. Building the libraries is not difficult, just following the instructions, and be sure to make the sample applications in order to use the command-line applications. Transforming a document was as simple as XalanTransform pets.xml pets.xsl pets.html. The transformation is rudimentary, but only because it is a sample application. Unless you are a C programer, for day-to-day transformations, you will probably want to use Saxon or xsltproc. The xalan distribution also comes with an Apache module allowing you to transform XML documents with XSLT on the fly. The functionality is much like AxKit, described below. Compile the module. Install it and the necessary libraries. Configure Apache. Restart Apache. Write XML and XSLT files saving them in the configured directory and when files are requested from the configured directory they will be created by transforming the XML with the XSLT. Xalan is a member of the large Apache Foundation suite of software. If you ever have the need for open source software, and the Apache Foundation has an application fulfilling your need, then consider using that application. The Foundation’s software has a good reputation. Saxon. Saxon is a Java-based XSLT processor written and maintained by the primary editor of the XSLT specification, Michael Kay. Saxon comes in two flavors (http://saxon.sourceforge.net/). One, Saxon-A, is a commercial product and supports

Managing XML with open source software 533

Figure 4.

LHT 23,4

534

XML schema. The other, Saxon-B, is open source and does not support XML schema. Both support the latest version of the XSLT standard (version 2.0), XQuery 1.0 and XPath 2.0. Saxon-A and B seem to be the first to implement these standards. Like most of the tools here, Saxon is intended to be incorporated into other applications, but it can be run from the command line as well. A command like this transforms an XML document with an XSLT stylesheet to produce and HTML file: java -jar saxon8.jar -t twain.xml tei2html.xsl . twain.html. Works for me. Saxon supports a number of extensions – functions not specified in the various standards. From the Saxon documentation, some of the more interesting extensions are: . decimal-divide() – performs decimal division with user-specified precision; . format-dateTime() – formats a date, time, or dateTime value; . max() – finds the maximum value of a set of nodes; . parse() – parses an XML document supplied as a string; . sum() – sums the result of computing an expression for every node in a sequence; . try() – allows recovery from dynamic errors. Implementing these extension in your applications may be helpful, but they will also limit the portability of your system if you need to migrate later. Complete with plenty of source code and complete documentation, Saxon is well-worth your time if your programming language of choice is Java. Publishers There exist entire systems for publishing XML. These systems take raw XML as input, combine it with XSL on-the-fly, and deliver it to the user. Cocoon is one such system (http://cocoon.apache.org/). If you are into TEI then TEI Publisher is another (http:// teipublisher.sourceforge.net/docs/). I prefer AxKit. AxKit. AxKit is a mod_perl module allowing the XML developer to incorporate XSLT processing into the Apache HTTP server (www.axkit.org/). It provides the means of transforming XML upon request and delivering it to HTTP user-agents in formats most appropriate for the end-user. In other words, when a user-agent requests an XML file, your web server can be configured to transform the XML with an XSLT stylesheet (as well as any input parameters) and output text in the desired format or structure. No need to transform the XML ahead of time and save many documents. Thus, AxKit implements the epitome of the “create once, use many” philosophy. Installing all of the underlying infrastructure required by AxKit is not trivial. First, you need mod_perl and installing it along side something like PHP can be confusing. AxKit then relies on the libxml2 and libxslt libraries and consequently these must be installed too. Finally, your Apache (HTTP) server needs to be configured in order to know when to take advantage of AxKit functionality. If you can configure all these technologies, then you can configure just about anything. As an exercise, I implemented a Webbed version of my water collection using MySQL, PHP, and AxKit[1]. To implement this collection I first created a relational database designed to maintain an “authority list” of water collectors as well as information about specific waters. The database includes a BLOB field destined to contain a photograph of each bottled water. I wrote a set of PHP scripts allowing me to

use an HTML form-based interface to do database I/O. In addition, I wrote another PHP script creating reports against the database. There are really only two types of reports. One report exports the contents of the BLOB fields and saves the results as JPEG images on the file system. The other report is an XML file of my own design. Each record in the XML file represent one water and it includes date, collector, name, and description of the water. I then wrote an XSLT stylesheet designed to take specific types of input like collector name IDs or water IDs. Finally, I configured my HTTP server to launch AxKit when user-agents access the water collection. The result is a set of dynamically created HTML pages allowing users to browse my water collection by creator or water name. Something very similar could be done for sets of text files (prose or poetry) or even sets of metadata such as MARCXML or MODS files. AxKit is an underused XML implementation, probably because it is hard to install. Building a system – my TEI publisher I make no bones about it. I’m not a great writer. On the other hand, over the years, I have written more stuff than the average person. Furthermore, I certainly don’t mind sharing what I write whether it be prose, the handout of a presentation, or the code to a software program. I’ve been practicing “green” open access and open source software long before the phrases were coined. As a librarian, it is important for me to publish my things in standard formats complete with rich meta data. Additionally, I desire to create collections of documents that are easily readable, searchable, and browsable via the web or print. In order to accomplish these goals I decided to write for myself a rudimentary TEI publishing system, and this section describes that system (http:// infomotions.com/musings/tei-publisher/). Ironically, this isn’t my first foray into this arena. When the web was still brand new (as if it still isn’t), I wrote a simple HTML editor using Hypercard called SHE (Simple HTML Editor). Later, I wrote a database program with a PHP front-end. Both systems created poorly formatted HTML, and both of those systems worked for a while. I suspect my current implementation will not stand the test of time either, but the documents it creates are not only well-structured but validate against TEI and XHTML DTD’s. The system also supports robust searching capabilities and dissemination of content via OAI. MySQL database The heart of the system is a MySQL database, and the code I’ve written simply does I/O against this database. The database’s structure is simplistic with tables to hold authors, subjects, templates, stylesheets, and articles. There are many-to-many relationships between authors and articles as well as subjects and articles. There are simple one-to-many relationships between templates and articles as well as stylesheets to articles. The scheme also includes a rudimentary sequence table in order to not mandate the use of MySQL’s auto_increment feature. Musings.pm While the database is the heart of the system, a set of object oriented Perl modules reduces the need to know any SQL. The Perl modules make life much easier, and I call them Musings.pm. Each module in the set corresponds to a table in the database, and each module simply sets and gets values, saves and deletes records, and supports

Managing XML with open source software 535

LHT 23,4

536

global find routines. After all that, is all you can do with databases: create records; find records; edit records; and delete records. Administrative interface Once the modules were written I was able to write an administrative interface in the form of a set of CGI scripts. Like the modules, there is one CGI script for each of the tables in the database. For example, the authors.cgi script allows me to add, find, edit, and delete author information – an authority file in library parlance. The subjects.cgi script allows me to manage a set of controlled vocabulary (subject) terms used to classify my articles. The templates.cgi file facilitates the maintenance of TEI skeletons. These skeletons contain tokens like ##AUTHOR## and ##TITLE##, and they are intended to be replaced by real values found in the other tables in order to create valid TEI output. The articles.cgi script is the most complex. It allows me to enter things like title, date created, abstract, and changes information. It also allows me to select via pop-up menus subject terms, authors, templates, and stylesheets to associate with the article. Once all the necessary information is entered, I use article.cgi’s “build” function to amalgamate the meta data and content with the associated template. The resulting XML is then saved locally. I then use the script’s “transform” function to change the saved XML into XHTML through the use of an XSLT stylesheet. (The stylesheet, like the template, is managed through a CGI script.) The XHTML files are complete with Dublin Core meta tags. Throughout this entire process I am careful to validate the created documents for not only well-formedness but validity as well. It is important to note the administrative interface is more of a publishing system and is in no way an editor. Each of the parts of the system (authors, subjects, templates, stylesheets, articles) are expected to include XML mark-up. It is up to me to mark-up the content before it gets put into the system. To accomplish this, I use a text editor (BBEdit) on my desktop machine, and BBEdit allows me to create a “glossary” or set of macros to mark-up the documents easily. The administrative interface simply glues the parts of the system together and saves the result accordingly (Figure 5). As the size of any collection grows so does the need for search functionality, but free text searching against relational databases is as pain. Such functionality is not really supported by relational database applications. Creating an index against the content of the database (or a set of files) makes searching much easier. Consequently, I provide search functionality through an indexer. Currently, my favorite indexer is swish-e (www.swish-e.org/). It supports all features librarians love: phrase searching, Boolean logic, right-hand truncation, nested queries, field searching, and sorting. Swish-e excels at indexing HTML files and/or XML files. During the indexing process you can specify what HTML elements are to become possibilities for field searches. I specify the Dublin Core meta tags. Thus, after I create my XHTML documents, I index the entire lot using swish-e. User interface I have now created a set of stand-alone, well-formed, valid TEI and indexed XHTML documents. What is then needed is a user-interface. Because all of the documents have been described with a set of controlled vocabulary terms, I can create a list of these terms and then list the articles associated with each term – a subject index. Since each

Managing XML with open source software 537

Figure 5. Here is a sample of screen shots from the administrative interface

LHT 23,4

538

article is associated with a date, I can list my articles in reverse chronological order. Since each article has a title, I can list them by alphabetically – a title index. Since the entire corpus is indexed, I can provide a search interface to the content. To make this happen I wrote one more CGI script, the system’s home page. This page includes an introduction to the collection, a search box, and a list of three links to title, date, and subject indexes. These indexes are created dynamically taking advantage of Musings.pm. If I got fancy, I could count the number of times individual articles where read and provide a list of articles ranked by popularity. Similarly, I could watch for types of searches sent to the system and create lists of “hot topics” as well. Now here’s a tricky thing. I know the subject terms of each article. I know the content has been indexed with swish-e. Therefore, I know the exact swish-e query that can be used to find these subject terms in the corpus of materials. Consequently, in the footer of each of my documents, I have listed each of the article’s subject terms and marked them up with swish-e queries. This allows me to “find more articles like this one”. As you may or may not know, an index is simply a list of words associated with pointers to documents. Swish-e provides a means of dumping all the words to standard output. By exporting the words and feeding them to a dictionary program, I can create a spell-checker. I use Aspell for this purpose (http://aspell.sourceforge.net/). Consequently, when searches fail to produce results, my user interface can examine the query, try to fix mis-spellings, reformat the query, and return it to the end-user thus providing a “did you mean” service a la Google. Lastly, I was careful to include the use of cascading stylesheet technology into the XHTML files. Specifically, I introduced a navigation system as well as a stylesheet for printed media. This provides the means of excluding the navigation system from the printed output as well as removing the other web-specific text decorations. I think my documents print pretty (Figure 6). The documents of my collection are mere reports written against the underlying database. There is no reason other reports can not be written as well, and one of those report types are OAI streams. Again, using the Musings.pm module, I was able to write a very short program that dumps all of my articles to sets of tiny OAI files. These files are saved to the file system and served to OAI harvesters through a simple, Perl-based OAI server called OAI XMLFile (www.dlib.vt.edu/projects/OAI/software/xmlfile/ xmlfile.html). Conclusion Creating the infrastructure to publish my documents was rather time consuming, but once this infrastructure was in place, it made it very easy publish a great number of documents consistently and accurately. Here is the process I use to publish things: (1) Have an idea; (2) Write it down; (3) Mark it up in TEI; (4) Assign subject terms; (5) Make sure the terms are in the database; (6) Add the TEI to the database; do data entry;

Managing XML with open source software 539

Figure 6. Here are a number of screen shots from the user interface

LHT 23,4

540

(7) (8) (9) (10) (11) (12) (13)

Build the TEI file; Check it for validity; Transform it into XHTML; Check it for validity; Index the entire corpus; Create OAI reports; Go to Step 1.

Given this system, I am able to spend most of my time articulating my ideas and writing them down. Steps 3 through 12 require only about 30 minutes. Sometimes I feel like Ben Franklin. He was a writer (a much better writer than myself). He owned his own printing press. (In fact he owned many of them across the colonies.) He also designed his own typeface. Not only that, he was Postmaster for a while. In short, he had control of the entire distribution process. With the advent of the web, much of that same distribution process is available to people and institutions like myself. All that needs to be done is design systems that fit one’s needs and implement them. Such things exemplify enormous opportunities for cultural heritage institution such as libraries, museums, and archives – as well as individuals. Note 1. Yes, I collect water. There are about 200 items in the collection and it represents natural bodies of water from all over the world. The webbed version of the collection only includes a sample of the entire thing, but my offices are littered with bottles of water from strange and wonderful places. See: http://infomotions.com/water/

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

THEME ARTICLE

Creating digital library collections with Greenstone Ian H. Witten and David Bainbridge Department of Computer Science, University of Waikato, Hamilton, New Zealand

Digital library collections

541 Received 7 June 2005 Revised 30 July 2005 Accepted 8 August 2005

Abstract Purpose – The purpose of this paper is to introduce Greenstone and explain how librarians use it to create and customize digital library collections. Design/methodology/approach – Through an end-user interface, users may add documents and metadata to collections, create new collections whose structure mirrors existing ones, and build collections and put them in place for users to view. Findings – First-time users can easily and quickly create their own digital library collections. More advanced users can design and customize new collection structures Originality/value – The Greenstone digital library software is a comprehensive system for building and distributing digital library collections. It provides a way of organizing information based on metadata and publishing it on the Internet or on removable media such as CD-ROM/DVD. Keywords Digital libraries, Collections management, User interfaces Paper type Technical paper

1. Introduction Digital libraries are organized, focused collections of information. They concentrate on a particular topic or theme – and good digital libraries will articulate the principles governing what is included. They are organized to make information accessible in particular, well-defined, ways – and good ones will include a description of how the information is organized (Lesk, 2005). The Greenstone digital library software is a comprehensive suite of software for building and distributing digital library collections (Witten and Bainbridge, 2003). It provides a new way of organizing information and publishing it on the Internet or on removable media (e.g. CD-ROM/DVD). It is widely used in a large number of countries: see www.greenstone.org for a representative selection of example sites. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and distributed as open source, multilingual software in cooperation with UNESCO and the Human Info NGO. The dissemination of educational, scientific and cultural information, and particularly its availability in developing countries, is central to UNESCO’s goals, and appropriate, accessible technology such as Greenstone is seen as a vital tool in this context. The Greenstone digital library software has grown out of the stimulating research environment of the New Zealand digital library project, and the authors gratefully acknowledge the profound influence of all project members. In particular, John Thompson pioneered the librarian interface described in this paper.

Library Hi Tech Vol. 23 No. 4, 2005 pp. 541-560 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636337

LHT 23,4

542

Aim and scope Greenstone aims to enable users, particularly in universities, libraries, and other public service institutions throughout the world, to build their own digital library collections in the fields of education, science and culture. UNESCO hopes this will encourage the effective deployment of digital libraries to share information and, where appropriate, place it in the public domain. The key points that Greenstone makes it its core business to support include: . design and construction of collections; . distributed on the web and/or removable media; . customized structure depending on available metadata; . end-user collection-building interface for librarians; . reader and librarian interfaces in many languages; and . multiplatform operation. The Appendix summarizes some relevant facts about Greenstone, grouped under “Technical” and “User base”. The liaison with UNESCO and Human Info has been a crucial factor in the development of Greenstone. Human Info began using Greenstone to produce collections in 1998, and provided extensive feedback on the reader’s interface. UNESCO wants to empower developing countries to build their own digital library collections – otherwise they risk becoming read-only societies in the information revolution. UNESCO selected Greenstone in 2000, and arranges user testing, helps with internationalization, and mounts courses. Internationalization is a central goal: today the Greenstone reader’s interface is available in 35 languages, and the librarian’s interface, including all documentation, is available in four (English, French, Spanish, Russian). Software distribution and development Greenstone is issued under the terms of the GNU General Public License. It originated in 1996 (Witten et al., 1996), and the current production version (Greenstone2) was designed about seven years ago, although it is continually being extended. A complete redesign and reimplementation, Greenstone3, has been described (Bainbridge, 2004) and released, informed by experience with the current system and the problems and challenges faced by users, international collection developers, and practicing librarians. Greenstone3 allows documents to be dynamically added to collections; provides more flexible ways to dynamically configure the run-time system by adding new services; lowers the overhead incurred by collection developers when accessing this flexibility to organize and present their content; and modularizes the internal structure. The design is based on widely accepted standards that were unavailable when Greenstone2 was designed. The production version, Greenstone2, is recommended for end-user librarians, while Greenstone3 is an emerging system currently intended for experimental use by computer scientists and information technologists. Greenstone3 is fully compatible with its predecessor, and can run old collections without any modifications whatsoever and make them indistinguishable from the original both visually and in terms of interaction. Librarian-level users can adopt Greenstone2 today, secure in the

knowledge that though the software is developing their collections will still run in exactly the same form tomorrow. The remainder of this article focuses exclusively on Greenstone2. Standards Many popular document and metadata standards are incorporated into Greenstone. As we shall see, it can deal with documents in HTML, Word, PDF, PostScript, PowerPoint, and Excel formats (amongst others); images in TIFF, GIF, PNG, and JPEG formats (amongst others); and metadata in Dublin Core, MARC, CDS/ISIS, and ProCite formats (amongst others). It can deal with multimedia formats such as MP3, MIDI, and QuickTime. Greenstone’s catholic approach to document and metadata standards creates many demands for conversion facilities. For example, users can change metadata elements from one metadata scheme to another by making choices interactively as they drag documents from one collection to another, or in other circumstances they can either use a default mapping to convert, for example, MARC records to Dublin Core, or define their own crosswalk file. Emerging digital library standards are also supported. Greenstone incorporates web mirroring software, so that whole sites can be downloaded using the HTTP protocol, to a pre-specified depth, and ingested into a collection. Metadata (and documents, if appropriately referenced) from an Open Archives Initiative (OAI) server can easily be ingested too, and any Greenstone collection can be served over the OAI protocol for metadata harvesting (OAI-PMH). Greenstone collections can be exported into the METS metadata encoding and transmission standard, and METS collections can be imported into Greenstone. (The particular form that Greenstone uses has been submitted to the METS Board as a proposed METS Profile.) An option has recently been added that allows end users – typically librarians, not computer specialists – to export a collection from Greenstone and import it into DSpace, and vice versa (Witten et al., 2005). The Greenstone librarian interface Greenstone users employ the “Librarian” interface to create and maintain digital library collections. This is intended to help librarians (and others who compile electronic anthologies) expedite the construction and organization of digital information collections. Only a few minutes of the user’s time are needed to set up a collection based on a standard design and initiate the building process, assuming that documents and metadata are already available in electronic form. More than a few minutes may be required to actually build the full-text indexes and browsing structures that comprise the collection, and compress the text. Some collections contain gbytes of text; millions of documents. Additionally, even larger volumes of information may be associated with a collection – typically audio, image, and video, with textual metadata. Once initiated, the mechanical process of collection-building may take from a few moments for a small collection to several hours for a multi-gbyte one that involves many full-text indexes. The librarian interface monitors all this and provides visual feedback over progress. Naturally, customized collections that have their own idiosyncratic requirements – as most substantial collections do – take longer to set up, and the design and debugging process can take days, weeks if iterative usability testing is involved. The

Digital library collections

543

LHT 23,4

544

Greenstone designers wholeheartedly endorse Alan Kay’s maxim that “simple things should be simple, complex things should be possible” (Davidson, 1993). The facilities that Greenstone provides, and the user interface through which library readers access them, are highly customizable at many different levels. Even librarians who need to produce new collections in just a few minutes can dictate what document formats (e.g. HTML, Word, PDF, PostScript, PowerPoint, Excel) or image formats (e.g. TIFF, GIF, PNG, JPEG) will be included, what forms of metadata (e.g. MARC records, OAI archives, ProCite, BibTex or Refer files, CDS/ISIS databases) are available, what searchable indexes will be provided (e.g. full text, perhaps partitioned by language or other features, and selected metadata such as titles or abstracts), and what browsing structures will be constructed (e.g. list of authors, titles, dates, classification hierarchy). Advanced users can control the presentation of items on the screen, personalizing each and every page that Greenstone serves up. All these facilities can be controlled through the Librarian interface. There are many additional features of Greenstone that lie outside the Librarian interface. Users can translate the interface into different natural languages. If they know HTML they can hook into Greenstone widgets like the full-text search mechanism or browsers from their own web pages. If they know JavaScript they can incorporate browsing mechanisms such as image maps, and using Perl they can add entirely new browsing facilities, such as stroke-based or Pinyin-based browsing for Chinese. Some new requirements are best met by altering the Greenstone “receptionist” program, written in Cþ þ , to add new facilities at runtime. The Greenstone Librarian interface is targeted at four different levels of user. (1) Assistant librarians gain access to the basic features of the Librarian interface: adding documents and metadata to existing collections, creating new collections whose structure mirrors existing ones, and rebuilding collections to reflect changes. (2) Librarians, the regular or default users of the Librarian interface, perform all the Assistant Librarian tasks above, and can also design new collections – adding, for example, new document types, new full-text indexes, and new metadata browsing features. They typically design a new collection by identifying an existing one that closely matches their needs and adapting its structure as necessary. (3) Library systems specialists can perform all the functions of Librarians, and in addition customize collections in more complex ways, such as those that involve defining and using regular expressions – for example, partitioning collections based on filename or directory structure. (4) Expert users are those who are experienced with Greenstone and are familiar with running Perl scripts and examining their output. These users can access all features of the Greenstone Librarian interface. The role and structure of metadata A digital library’s organization is reflected in the interface it presents to users. Much of the organization rests on metadata – structured information about the resources (typically documents) that the library contains. Metadata is the stuff in the traditional card catalogs of bricks-and-mortar libraries (whether computerized or not). It is

“structured” in that it can be meaningfully manipulated without necessarily understanding its content. For example, given a collection of source documents, bibliographic information about each document would be metadata for the collection. The structure is made plain, in terms of which pieces of text represent author names, which are titles, and so on. The notion of “metadata” is not absolute but relative: it is only really meaningful in a context that makes clear what the data itself is (Lagoze, 2000). For example, given a collection of bibliographic information, metadata might comprise information about each bibliographic item, such as who compiled it and when. The use of metadata as the raw material of organization is really the defining characteristic of digital libraries: it is what distinguishes them from other collections of online information. It is metadata that allows new material to be sited within a library and hooked into existing structures in such a way that it immediately enjoys first-class status as a member of the library. Adding new material to ordinary online information collections requires manually linking it in with existing material, but the only manual work needed when adding new items to a digital library is to determine metadata values for each one. If a standard metadata scheme is used, even that may be unnecessary: the information may already be available from another source. In Greenstone, one or more metadata sets are associated with each collection. There are a few pre-prepared sets, of which Dublin Core is one. Modifications to existing sets and new ones can be defined using an auxiliary Greenstone application called GEMS (Greenstone Editor for Metadata Sets). One important set is the extracted metadata set, which contains information extracted automatically from the documents themselves (e.g. HTML Title tags, meta tags, or built-in Word author and title metadata). This is always present behind the scenes, though it may be hidden from the user. The system keeps metadata sets distinct using namespaces. For example, documents can have both a Dublin Core Title (dc.Title) and an extracted Title (ex.Title); they do not necessarily have the same value. Behind the scenes, metadata in documents, and metadata sets themselves, are represented in XML. In order to expedite manual assignment of metadata, the Librarian interface allows metadata to be associated with document folders as well as with individual documents. This means that users can take advantage of existing document groupings to add shared metadata in one operation. Within the interface users can organize the document hierarchy by dragging items around and creating new sub-hierarchies, which may expedite joint metadata assignment. Metadata values assigned to a folder remain with that folder and are inherited by all files nested within it. If the user subsequently selects a file and changes an inherited metadata value, a warning appears that doing so will override the inherited value. (Of course, these warnings can be turned off: for experienced users they soon become annoying.) Metadata in Greenstone can be a simple text string (e.g. title, author, publisher). Or it can be hierarchically structured, as with hierarchical classification values, in which case new values can be placed in the classification tree. In addition, it is multivalued: each element can have more than one value. This is used, for example, for multiple authors. The Librarian interface allows existing metadata values to be reused where appropriate, encouraging consistency in metadata assignment by eliminating the need to retype duplicate values.

Digital library collections

545

LHT 23,4

546

2. Working with the librarian interface Within the Librarian interface, users collect sets of documents, import or assign metadata, and build them into a Greenstone collection. It is an interactive platform-independent Java application that runs on the same computer that operates the Greenstone digital library server[1]. It is closely coupled to the server, and tightly integrated with Greenstone’s collection design and creation process. It incorporates various open-source packages for such tasks as file browsing, HTML rendering, web mirroring, and efficient table sorting. The Librarian interface supports six basic activities, which can be interleaved but are normally undertaken in this order: (1) If required, download any documents from the web that need to be included in the collection. This optional step is only relevant when the collection will contain material that is sourced from the web. (2) Bring documents into a collection – whether to populate a new collection or update an existing one. Metadata files may also be brought in. Users browse the the computer’s file space to find documents to include, and drag and drop them into place. Any documents imported from existing Greenstone collections come with existing metadata attached. (3) Enrich the documents by adding metadata to them manually. Documents can be grouped into folders (it is easy to retain existing folder structures when dragging documents in under step 1), and any metadata assigned to folders is inherited by all documents nested within them. (4) Design the collection by determining its appearance and the access facilities that it will support: full-text search indexes, browsing structures, the format of items on the pages that Greenstone generates, etc. This design facility is not available at the Assistant Librarian user level. (5) Build the collection using Greenstone. This work is done by the computer; users are presented with a progress bar. This is the point where Expert users might examine the output of Perl scripts, which are presented in a scrolling window, to determine if anything is going wrong. (6) Pass the newly-created collection to the Greenstone digital library server for previewing. The collection is automatically installed as one of those in the user’s personal digital library, and a web page is opened showing the collection’s home page. To convey the operation of the Librarian interface we work through a small example. Figures 1-12 are screen snapshots at various points during the interaction. This example uses documents in the Humanity Development Library Subset collection, which is distributed with Greenstone. For expository purposes, the walkthrough takes the form of a single pass through the steps listed above. A more realistic pattern of use, however, is for users to switch back and forth through the various stages as the task proceeds. Assembling source material To commence, users either open an existing collection or begin a new one. Novice users (“Assistant Librarians”) generally work with existing collections, adding documents

Digital library collections

547

Figure 1. Starting a new collection

Figure 2. Exploring the local file space

LHT 23,4

548

Figure 3. Importing existing metadata

Figure 4. Filtering the file trees

Digital library collections

549

Figure 5. Assigning metadata using the Enrich view

Figure 6. Viewing all metadata assigned to selected files

LHT 23,4

550

Figure 7. Designing the collection

Figure 8. Specifying which plug-ins to use

Digital library collections

551

Figure 9. Configuring the arguments to a plug-in

Figure 10. Adding a full-text-search index

LHT 23,4

552

Figure 11. Adding a cross-collection search facility

Figure 12. Getting ready to create the new collection

and/or metadata. However, they can begin a new collection by copying the structure of an existing one, effectively creating an empty shell exactly like an existing collection, and adding documents and metadata to it. Collection design involves more advanced skills. Figure 1 shows the process of starting a new collection. Having selected New from the file menu, the user fills out general information about the collection – its name and a brief description of the content – in the popup window shown. The name is a short phrase used to identify the collection throughout the digital library: existing collections have names like Food and Nutrition Library, World Environmental Library, and so on. The description is a statement about the principles that govern what is included in the collection, and appears under the heading About this collection on the collection’s home page. At this point, the user decides whether to base the new collection on an existing one, selecting from the menu pulled down in Figure 1, or design a new one. In this example we will design a new collection, and now it is necessary to select one or more metadata sets for it. We choose Dublin Core from a popup menu (not shown in the Figure). Now the remaining parts of the interface, which were grayed out before, become active. The Gather panel, selected by the eponymous tab near the top of Figures 1-4, is active initially. It allows the user to explore the local file space and existing collections, gathering selected documents into the new collection. The panel is divided into two sections, the left for browsing existing file structures and the right for organizing the documents in the collection. Users navigate the existing file structure hierarchy in the usual way. They can select files or directories, drag them into the collection on the right, and drop them there. Entire file hierarchies can be dragged and dropped, and files can be multiply selected in the usual way. Users can navigate around the collection on the right too, adjusting the file hierarchy by dragging items around, creating new sub-hierarchies, and deleting files if necessary. Another possible source of documents is the web itself, and the Download panel can be used for this. This panel has many options: mirroring depth, automatically download embedded objects like images, only mirror from the same site, etc. URLs are entered into the panel (typically cut and pasted from a browser), and the system maintains a queue of items to download. The actual download operation is accomplished by a widely-used open-source mirroring utility. The resulting files appear as another top-level folder, called downloads, on the left-hand side of the Gather panel. In Figure 2 the interactive file tree display is being used to explore the local file system. At this stage the collection on the right is empty; the user populates it by dragging files of interest from the left-hand panel and dropping them into the right-hand one. Such files are copied rather than moved, so as not to disturb the original file system. Existing collections are represented by a subdirectory on the left called “Greenstone Collections”, which can be opened and explored like any other directory. However, the documents therein differ from ordinary files because they already have metadata attached, which the Librarian interface preserves when it moves them into the new collection. Conflicts may arise because their metadata may have been assigned according to a different metadata set from the one attached to the new collection, and the Librarian interface helps the user resolve these. In Figure 3 the user has selected some documents from an existing collection and dragged them into the new one. The popup window explains that the metadata element Organization cannot be automatically imported, and asks the user to either select a .

Digital library collections

553

LHT 23,4

554

metadata set and press Add to add the new element to that set, or choose a metadata set and element, and press Merge to effectively rename the old metadata element to the new one by merging the two. Metadata in subsequent documents will be imported in the same way automatically. When large file sets are selected, dragged, and dropped into the collection, the copying operation may take some time – particularly if metadata must be converted too. The Librarian interface indicates progress by showing which file is being copied and what percentage of files has been processed. The implementation is multi-threaded: users can proceed to another stage while copying is still in progress. Special mechanisms are needed for dealing with large file sets. For example, the user can filter the file tree to show only certain files, using a dropdown menu of file types displayed underneath the trees. In the right-hand panel of Figure 4, only HTM and HTML files are being shown (and only these files will be copied by drag and drop). In fact, the left-hand panel is showing the same part of the file space without filtering, and you can see the additional .png and .jpg files that are present there. Adding metadata to documents The next phase of collection-building is to enrich the documents by adding metadata. This is where the Librarian users spend most of their time: enhancing the collection by selecting individual documents and manually adding metadata. We have already discussed two features of the Librarian interface that help with this task: (1) Documents that are copied during the first step come with any applicable metadata attached. (2) Whenever possible, metadata is extracted automatically from documents. The Librarian implements two further features that expedite manual metadata assignment: (1) Metadata values can be assigned to several documents at once, either by virtue of them being in a folder, or through multiple selection. (2) Previously-assigned metadata values are kept around and made easy to reuse. The Enrich tab brings up the panel of information shown in Figure 5. On the left is the document tree representing the collection, while on the right metadata can be added to individual documents, or groups of documents. Users often want to see the document they are assigning metadata to, and if they double-click a document in the pane on the left, it is opened by the appropriate viewing program. Here, the user has selected a document and typed “new creator” as its dc.Creator metadata. The buttons for appending, replacing and removing metadata become active depending on what selections have been made. Values previously assigned to Creator metadata are shown in the pane labeled “All previous values”. At any time the user can view all the metadata that has been assigned to the collection. The popup window in Figure 6 shows the metadata in spreadsheet form. For large collections it is useful to be able to view the metadata associated with certain document types only, and if the user has specified a file filter as mentioned above, only the selected documents are shown in the metadata display.

Designing a collection All except “Assistant Librarian” users of the Librarian interface have the ability to design new collections, which involves specifying the structure, organization, and presentation of the collection being created. The result of this process is recorded in a “collection configuration file”, which is Greenstone’s way of expressing the facilities that a collection requires in a machine-readable form. Collection design has many aspects. Users might review and edit collection-level metadata such as title, author and public availability of the collection. They might define what full-text indexes are to be built. They might create sub-collections and have indexes built for them. They might add or remove support for predefined interface languages. They will need to decide what document formats will be included. In Greenstone, document types are processed by modules called “plug-ins”, and each plug-in may need to be configured by specifying appropriate arguments. The collection designer will need to specify what browsing structures will be constructed – in Greenstone, these are built by modules called “classifiers”, which also have various arguments. It will also be necessary to specify the formatting of various items in the collection’s user interface. Sensible, generally-applicable defaults are supplied for all these features. Users accomplish this process with the Design panel illustrated in Figures 7-10. It has a series of separate interaction screens, each dealing with one aspect of the collection design. In effect, it serves as a graphical equivalent to the process of editing the raw collection configuration file manually. In Figure 7 the user has clicked the Design tab and is reviewing general information about the collection, which was entered when the new collection was created. On the left are listed the various facets that the user can configure: Document Plug-ins, Search Types, Search Indexes, Partition Indexes, Cross-Collection Search, Browsing Classifiers, Format Features, Translate Text, and Metadata Sets. For example, clicking the Document Plug-in button brings up the screen shown in Figure 8, which allows you to add, remove or configure plug-ins, and change the order in which the plug-ins are applied to documents. Both plug-ins and classifiers have many different arguments or “options” that the user can supply. The dialog box in Figure 9 shows the user specifying arguments to a plug-in. The grayed-out fields become active when the user adds the option by clicking the preceding tick-box. Because Greenstone is a continually growing open-source system, the number of options tends to grow as developers add new facilities. To help cope with this, Greenstone has a “plug-in information” utility program that lists the options available for each plug-in, and the Librarian interface automatically invokes this to determine what options to show. This allows the interactive user interface to automatically keep pace with developments in the software. In Figure 10 the user is adding a new full-text-search index to the collection, in this case based on both dc.Creator and dc.Description metadata. In Figure 11 she is adding a “cross-collection search” capability so that other collections are searched whenever this one is. Building the collection The next step is to construct the collection formed by the documents and assigned metadata. The brunt of this work is borne by the Greenstone code itself. The user observes the building process though a window that shows the text output generated by Greenstone’s importing and index-building scripts – filtered for brevity in all but

Digital library collections

555

LHT 23,4

the Expert user level – along with a progress bar that indicates the overall degree of completion. Figure 12 shows the Expert (i.e. most detailed) version of the Create view through which users control collection building. On the left are groups of options that can be applied during the creation process: Import, Build, and Message Log. The user selects values for the options if necessary, and clicks Build Collection.

556

3. Beyond the librarian interface Most of the customization that non-programming users perform in Greenstone takes place in the collection configuration file, which the Librarian interface creates. It depends crucially on the availability of metadata, and the structures defined are only produced if appropriate metadata is provided. However, Greenstone has more advanced customization features. Our philosophy is to target the most common features and make them accessible to librarian-level users without particular training in computer science. But users who are prepared to dig deeper can accomplish more. Macros Greenstone incorporates a macro facility, expressed as an extension of HTML. It includes the ability to define macros and perform textual substitution. Currently, for example, there are interfaces in over 35 languages, from Arabic to Turkish, Bosnian to Ukrainian, Chinese to Vietnamese. To accommodate these variants, and to allow the language interfaces to be updated when new facilities are added, all web pages are passed through a macro expansion phrase before being displayed. This means that a new language can be added by providing a new set of language-specific macros, a task that has been performed many times by people with no expertise in Greenstone. The digital library functionality is hooked into the user interface through “dynamic macros” whose expansions are determined by the system (in terms of other macros). For example, the search widget is generated by a dynamic macro. Users can incorporate this widget into their own web pages, provided they go through the macro expansion phase. A total of about 20 dynamic macros provides access to Greenstone’s full user interface functionality. Users who work with Greenstone can capitalize on the macro system to radically alter the style of the pages generated, and some have produced attractive new designs for the Greenstone user interface (Zhang, 2003). Altering the run-time system The part of Greenstone that serves collections to users is called the “receptionist”, and one sometimes has to resort to changing this program to achieve a desired level of customization. This rarely involves large changes, but creates software management difficulties in dealing with different parallel versions. Our system development strategy is to accept the inevitability of occasionally having to build a special-purpose collection-dependent receptionist to achieve some desired features, and to note what is required with a view to incorporating it as an option within the standard Greenstone code. 4. Conclusions A general-purpose digital library system like Greenstone must cater for a wide range of users. We have targeted the Librarian interface at four different user levels: assistant librarians, who can add to existing collections and create new ones with the same

structure; librarians, who can, in addition, design new collections; library systems specialists, who can customize collections in more complex ways; and expert users, who can deal with every aspect of the system. A digital library may be customized in a wide variety of different ways, and virtually every collection has its own idiosyncratic requirements. Although a basic Greenstone collection of new material with a standard look and feel can be set up in just a few minutes, most users want more personalization. As the number of collections grows and the variety of styles increases, it becomes more likely that some existing collection will match new requirements. It is always difficult to produce good, up to date, documentation for a richly functional software system. In fact, from a user’s point of view the chief bottleneck in customization is documentation, not the facilities that are provided. Consequently, collection builders need access to advice and assistance from others, in order to continue to learn how to tailor the software to meet ever-changing requirements. There is a lively Greenstone e-mail discussion group; participants hail from around 70 countries. Digital libraries have the advantage over other interactive systems that their user interfaces are universally based on metadata. Metadata is the glue that allows new documents to be added and immediately become first-class citizens. It is also the key to user interface customization, and Greenstone incorporates a range of mechanisms at different levels to capitalize on this.

Note 1. There is also an applet version that allows collections to be constructed remotely.

References Bainbridge, D., Don, K.J., Buchanan, G.R., Witten, I.H., Jones, S., Jones, M. and Barr, S.I. (2004), “Dynamic digital library construction and configuration”, Proceedings of the European Digital Library Conference, Bath. Davidson, C. (1993), “The man who made computers personal”, New Scientist, 1978, June, pp. 32-5. Lagoze, C. and Payette, S. (2000), “Metadata: principles, practices and challenges”, in Kenney, A.R. and Rieger, O.Y. (Eds), Moving Theory into Practice: Digital Imaging for Libraries and Archives, Research Libraries Group, Mountain View, CA, pp. 84-100. Lesk, M. (2005), Understanding Digital Libraries, Morgan Kaufmann, San Francisco, CA. Witten, I.H. and Bainbridge, D. (2003), How to Build a Digital Library, Morgan Kaufmann, San Francisco, CA. Witten, I.H., Cunningham, S.J. and Apperley, M. (1996), “The New Zealand digital library project”, D-Lib Magazine, Vol. 2 No. 11, available at: www.dlib.org/dlib/november96/ newzealand/11witten.html Witten, I.H., Bainbridge, D. and Tansley, R. (2005), “StoneD: a bridge between Greenstone and DSpace”, paper presented at the Joint Conference on Digital Libraries, Denver, CO. Zhang, A. (2003), Customizing the Greenstone User Interface, Washington Research Library Consortium, Washington, DC, August, available at: www.wrlc.org/dcpc/ UserInterface/interface.htm

Digital library collections

557

LHT 23,4

558

Figure A1.

Appendix

Digital library collections

559

Figure A1.

LHT 23,4

560

Figure A1.

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

OTHER ARTICLE

Taking pro-action A survey of potential users before the availability of wireless access and the implementation of a wireless notebook computer lending program in an academic library

Taking pro-action

561 Received 10 October 2004 Revised 26 October 2004 Accepted 23 November 2004

Hugh A. Holden Monmouth University Library, West Long Branch, New Jersey, USA, and

Margaret Deng Union County College, Elizabeth, New Jersey, USA Abstract Purpose – The purpose of the article is to gauge reaction to the implementation of a wireless laptop lending program in a university library before it actually became operational and wireless access itself became available. Design/methodology/approach – This online survey consisted of 22 multiple choice questions that all Monmouth University students and employees were invited by e-mail to answer. Findings – The vast majority of responses came from students, and most of them were ready for wireless access in the library and across campus. Several re-emphasized in text their desire to log-on to the network with their own laptops. Research limitations/implications – The survey ran for only two weeks, and yet, because tabulation was done by hand, a response rate ten times greater would have made our method impracticable. Practical implications – This kind of survey is comparatively easy and fast to implement. It lends itself to follow-up surveys to measure the success of a wireless computer program or other technological development, including the possible effects on user attitude. Originality/value – This study was original in that it took place just before a wireless laptop-lending program was activated. Tightly focused online surveys with a limited number of questions can help librarians anticipate issues not considered or sufficiently emphasized earlier, or quickly assess the impact wireless access is having. Keywords Laptops, Surveys, Wireless Paper type Research paper

Introduction A search of the literature showed that library user surveys are mainly conducted post facto to gauge the intended audience’s reaction to a change that has occurred, whether as a result of a deliberate action, such as a new resource or service (e.g., wireless access), or a cutback of some kind. It appears that far fewer surveys are conducted pro-actively, in which the survey focuses upon a change that has not yet occurred, such as a new service that the library is considering or developing (Pitkin, 2001). To be sure,

Library Hi Tech Vol. 23 No. 4, 2005 pp. 561-575 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636346

LHT 23,4

the internet has not eliminated other ways of doing surveys, but it has greatly expanded the concept, allowing for types of surveys that could not be done cheaply before (or by so many people), including very narrow “targeting” within a massive potential respondent base (“everyone connected to the internet”). Consequently, libraries can benefit from small, quickly implemented, targeted surveys. That’s what the authors have attempted here.

562 Purpose and rationale The objective of this survey was to quickly gather directly from the intended user population information that could prove useful in planning and managing a wireless notebook computer-lending program in the academic library of a small university (FTE 4381 plus about 1,100 employees at the time). This relatively short online (or “web based”) survey was not intended for the deep study of a population but, rather, to help the staff of the Guggenheim library (hereafter also “the library”) quickly develop a better sense of what the people of Monmouth University who use the library are thinking when they think about what is variously called “wireless fidelity” (“wi-fi”), “wireless access,” “wireless computing,” “ubiquitous computing,” or “pervasive computing.” Though many shades of meaning can be teased out of these different phrases (and their use in the literature seems to vary per author), for the purpose here, all these terms refer to people using battery (or A/C) powered notebook (also known as “laptop”; both words are accepted as denoting ready portability) computers that are capable of accessing the library’s online resources and the internet via a wireless network (WLAN) connected through Access Points (AP) placed strategically throughout the public areas of the library, to the University’s network. (From here forward, the phrase “wireless notebook computer” and its acronym, WNC, will be used interchangeably.) The library stood to become the first point of wireless network access for students anywhere on the campus of this institution. (There were no other means of such access at the time of this study.) But, until this survey, Monmouth librarians have had to base discussions and decisions mainly on their individual impressions and speculations, what had been learned (mainly through trade journals) about other universities’ experiences, and what was understood about the technology. It occurred to the authors that this period of time, that is, the time prior to actual public access to wireless notebook computers in the library or wireless access anywhere on campus, was both the last and a very good opportunity in which to examine the perspective users of WNCs. Doing so could help the library anticipate needs, concerns, and expectations that people will have for this new library service, thereby allowing the librarians to anticipate difficulties or even make substantial changes to the wireless notebook lending program before it is actually set in operation. Literature review A search of the library literature did not yield any description of a survey designed with the intent to learn about potential users of library WNCs prior to the establishment of a WNC program. (Informal channels such as listservs are rich with information and “chat” on the subject.) Certainly, user surveys and analyses of circulation/usage statistics have been done post facto; they are now a commonality and, for many libraries, a part of regular self-analysis (Dugan, 2001, p. 296).

In the process of looking, however, we saw that, on a much larger scale, surveys of professionals (including librarians) responsible for the network or computing technology of organizations seem to support the common impression that colleges are well on their way toward universally embracing wireless technology for the full spectrum of college related activities. According to a survey of 632 campus officials conducted in 2002 by the “Campus Computing Report”, 67.9 percent of the institutions surveyed have “functioning wireless LANS” on their campuses (Green, 2003). Yet, this same survey data “suggest that wireless services cover just under a fifth (18.3 percent) of the physical campus at those institutions reporting wireless networks . . . ” (Green, 2003). Green does not take this suggestion to the logical next step and infer that wireless computing on most campuses that have it is (as of the time of the survey) far from campus wide or ubiquitous. If most campuses are indeed taking an area by area approach to establishing wireless connectivity, then Monmouth University (“MU”) is not out of the norm by piloting or experimenting with the library first. Monmouth may be unusual in its wireless pre-preparation prior to actual implementation. Methodology and survey instrument Why a simple web page format? In doing this survey online, the authors followed what has become an accepted operating procedure for on-line survey distribution. They were targeting a group already largely accustomed to using the library’s services and resources on-line, and it can be assumed that at least the students, if not all staff and faculty, would be quite familiar with interacting over the internet. Everything from e-mail and personal server space to online registration for courses is the norm. A number of MU students each year take full courses on-line and many faculty members manage their courses with a proprietary program (Educator by Ucompass at the time of this survey). Given what seemed a generally favorable climate for an online survey, it was hoped that participation would be encouraged by keeping the survey relatively short and easy to do. (All but one item was multiple-choice.) To this was added the promise of anonymity – though demographic information was sacrificed – to hopefully further boost the rate of response. The above factors, alone, could have been sufficient for preferring web-based to traditional paper-and-pen survey methods. The savings in cost – in terms of both time and labor – is another major reason why web-based surveys now flourish. And, in this case, expediency was critical because of the very small window of opportunity. How was the survey form organized? The survey instrument was built, using Dreamweaver, as a form on a single web page in HTML with JavaScript. The survey questions were arranged in three sections of: four questions (A), 18 questions (B), and a text box for comments (C). All questions were multiple-choice using “radio buttons” set to make it impossible to chose more than one answer. People were neither encouraged nor discouraged from identifying themselves in the text box in Part C. The instructions only stated that, for a personal answer to a question or concern, respondents must write a separate e-mail to the address supplied.

Taking pro-action

563

LHT 23,4

564

How did the survey operate? The survey web page was linked to the library’s home page where attention was drawn to it by an open invitation that changed color on mouse-over, and a thumbnail image of an open notebook computer. The link remained in place and the survey remained active for the last two weeks of the spring 2003 semester (14 days). A customized CGI script on the server side collected and sent the data from each completed survey form to the authors as e-mail. Attention was drawn to the survey Immediately after the survey became active, the exact same e-mail invitation was sent to all students and employees of the university. This e-mail message described the location of the link but also included the exact URL. In the survey’s introduction, people were encouraged to complete all the multiple-choice questions (excluding the optional Part C) by the caution that their answers would not otherwise register. Results and analysis Because processing of the raw data was not automated nor assisted by special software, considerable handwork was required. The JavaScript applet tagged each datum with the name of the question (or part of the question, in the cases of B5 and B16). When the submit button was clicked by a survey respondent, that data set was sent directly, unsorted, to the e-mail of the authors. MS Excel was used to organize the data and crunch the numbers. The resulting figures were copied into tables in MS Word. (Note that, in the discussion that follows, some questions are slightly altered in wording to fit the space available.) Though the absolute numbers are low (see Table I), the proportions roughly mimic the populations of these groups on campus. In fact, the student response rate outpaced that of faculty by a ratio of about 7 to 1. Even if the authority of this survey is limited by the small total of returns, the authors believe that the results support the common knowledge on the MU campus that the students, as a group, are far more accepting of, comfortable with, and even casual about the interactions made possible by the so-called communications revolution of the past quarter century. (Though no conclusions can be rightfully drawn from the fact, it might be noted that the only people who wrote separately (e-mail) to say they had encountered technical or conceptual problems with the survey were both faculty.) Generational differences generally will result in students and faculty approaching this survey differently. For example, though most students, by virtue of their youth, are more accustomed than any other group to being asked for feedback online, they have also been conditioned more than any other cohort to ignore many such requests. (The authors considered, in Which one of the following best describes your relationship to Monmouth University?

Table I. Survey, Part A1

Student Staff Faculty Administrator Alumni A member of the public not affiliated with MU Total respondents:

Response (%)

Response total

76.47 5.35 10.70 6.42 0.00 1.07

143 10 20 12 0 2 187

planning the survey, to encourage participation by offering some small material incentive, but this perennially popular technique would have meant logistical problems that could not be managed on such short notice.) There is necessarily a fudge factor built into a survey question such as A2 (see Table II), where people are asked to “estimate your average” time spent in the library. It’s entirely possible – probable in fact – that the figures the respondents chose would contrast markedly with those an observational study would produce. But the latter is hardly necessary for the purpose here. As with most user surveys, it was necessary to trust that most people are honest most of the time and, even if guessing, guess with consistent degrees of precision and accuracy. The single largest number, 70 (37.43 percent), spent an average of less than 5 hours per week in the library. Another study would need to be done to confirm or refute the impression that many MU students regard the library as a pit stop for information (or other activities, of course) and not as a place to anchor their information gathering and organizing activities. Nevertheless, a major impetus for implementing a WNC lending program is, indeed, to make the library more attractive to potential users. If graphed, the data in Table III creates a gentle wave curve, suggesting that people are almost evenly divided across the board. Though there is not proof in this data that the library has become little more than another computer lab for most students, for a parity of these respondents (51.54 percent), the library appears to have acquired or nearly acquired that role. This information places an extra emphasis on that acquired from Question A2. In addressing the online dimension of library use, Question A4 (see Table IV) seemed to support the “stop and shop” model of current tends in library use. Yet, there appeared to be a measurable if small number who turned to the library online as a major tool of their research efforts. It is tempting to ask, “Where are the rest of the How many hours do you spend in the library? Please choose the answer nearest to what you estimate your average to be? 20 or more hours per week More than 10 but less than 20 hours per week More than 5 but less than 10 hours per week Less than 5 hours per week No idea or too small to measure Total respondents:

When in the library, how much of your time is spent using a computer? Again, please estimate Less than 40% of my time About 50% of my time. Between 60 and 80% of my time My time is almost exclusively spent using the computers Total respondents

Response (%)

Response total

4.81 7.4 16.04 37.43 34.22

9 14 30 70 64 187

Response (%)

Response total

27.27 21.39 26.74 24.60

51 40 50 46 187

Taking pro-action

565

Table II. Survey, Part A2

Table III. Survey, A3

LHT 23,4

University’s students going for their research needs?” In fact, the figures of this survey seem to fall in line with those of other studies which paint an image of modern college students as increasingly impatient with traditional research methods, but now “traditional” refers to online databases that are pure bibliographic indices, abstracts, or otherwise not fully full-text.

566

Part B Part B of the survey is the longest section; it contains 18 multiple-choice questions. Questions 1, 2, 4, and 17 asked the respondent to place himself on a scale between “strongly agree” and “strongly disagree”. On the “agree” side of “neutral” in Question B1 (see Table V), 166 respondents (88.77 percent) felt that “wireless computing” will eventually become the norm at all levels of education. Given the makeup of the respondents (Question A1), this was not at all surprising. Only 9 (4.81 percent) disagreed. (Part C was used by some who chose “Strongly Disagree” to detail their concerns.) In Question B2 (see Table VI), 161 (86.09 percent) of the respondents agreed to various degrees that wireless notebook computing would be a welcomed addition to the library. Please keep in mind that, at this point, no qualifications had been placed on this idea. So, in its purest form, so to say, wi-fi is very welcome. Curiously, the responses to Question B3 (see Table VII) would draw a shallow valley-shaped line graph. Those who thought that there should be no time limit for a checked-out WNC (55 or 29.41 percent) and those who chose the lowest limit offered as a selection, two hours (53 or 28.34 percent) show less than 1 percent difference. The How much time per day, on average, do you spend using the library’s web pages . . . no matter where you are physically?

Table IV. Survey, A4

Less than 20 minutes More than 20 minutes but less than 1 hour 1-2 hours More than 2 hours Total respondents:

Response (%)

Response total

48.13 24.06 16.58 11.23

90 45 31 21 187

How much do you agree or disagree with the following statement? “Wireless computing” will eventually become as “ordinary” as cell phones and it will be used by students in all levels of education Relative agreement/disagreement 1 2 3 4 5 6 7

Table V. Survey, B1

Strongly: Agree (1) , , (n) Agree (3) (%) Neutral (4): (n) (%) Disagree (5): .. Strongly (n) Disagree (7) (%) Total respondents

122 65.24

31 16.58

13 6.95 12 6.42 2 1.07

3 1.60

4 2.14 187

Taking pro-action

How much do you agree with this statement? I would like to see the library offer “wireless mobile computing” Relative agreement/disagreement 1 2 3 4 5 6 7 Strongly: Agree (1) , , (n) Agree (3) (%) Neutral (4): (n) (%) Disagree (5): .. Strongly (n) Disagree (7) (%) Total respondents

123 65.78

23 12.30

15 8.02

567 18 9.63 2 1.07

2 1.07

4 2.14 187

Choose which of the following you think would be the best borrowing time limit for the WNCs. (Assume that there will be a limited number of these computers available) 2 hours 3 hours 4 hours There should be no time limit Total respondents:

Response (%)

Response total

28.34 23.53 18.72 29.41

53 44 35 55 187

Table VI. Survey, B2

Table VII. Survey, B3

question asked people to assume that “there will be a limited number of these computers available”. Note that, at this point, respondents had not been told directly that lending will be confined to the library: no WNC will be allowed to leave the building. (Of course, one could certainly infer from this and later questions that the library was weighing this issue.) Question B4 (see Table VIII) was intentionally designed with the expectation of strong reactions. On the contrary, the largest number of responses was “Neutral” (75 or How much do you agree or disagree with the following statement? “A WNC program is of limited value to me if I cannot take the computer anywhere and keep it for an extended eriod of time” Relative agreement/disagreement 1 2 3 4 5 6 7 Strongly: Agree (1) , , (n) Agree (3) (%) Neutral (4): (n) (%) Disagree (5): .. Strongly (n) Disagree (7) (%) Total respondents

36 19.25

17 9.09

15 8.02 75 40.11 11 5.88

9 4.81

24 12.83 1

Table VIII. Survey, B4

LHT 23,4

568

28.34 percent). However, the next highest general group (68 or 36.36 percent) agreed-to-strongly agreed that the lack of complete mobility and long-term borrowing made a WNC borrowing program of “limited value to me”. In other words, respondents seemed to be more pragmatic than dogmatic on this issue. Most would not immediately ignore a program that may have serious limitations compared to their ideal. Answers to this question could have been coordinated with group identifying information, but determining how many students said that students should have highest priority wasn’t the point of this question. Rather, the authors wanted to see if response distribution accorded with the sense of priorities of library personnel. And indeed it did (see Table IX). A roughly straight line can be drawn from the top left to the bottom right cells in the table to show that students are given highest priority (153), and “Non-affiliated” 6th at 153. Question B6 (see Table X) would also yield a graph with a very steep curve. It could not fairly be inferred from this question that these same people actually would avail themselves of the opportunity. Nevertheless, the response was so overwhelmingly in favor of the idea that it is safe to infer from it a high degree of “conceptual preparedness” for wireless computing. Question B7 (see Table XI) was intended to add back into the picture of portable wireless computing one of the complications that were purposely left out of B6. 58.82 percent would not expect hands-on assistance. This suggests a rather high degree of independence – or, at least, an assumption thereof. Personal notebook PCs commonly appear in the library. Many people today think of PCs much as they do cell phones. Yet, to those who know computer and network technology well, nothing is assumed and Who should have highest priority when it comes to borrowing a library WNC? Who should have second priority? and so on. CLICK the numbers from 1 (highest) to 6 (lowest) according to what you think each group’s priority should be (Only 1 number for each group, please) Less , , Priority . . More 1 2 3 4 5 6 Total

Table IX. Survey, B5

Student Staff Faculty Administrator Alumni Non-affiliated

163 8 26 12 6 10

12 54 87 27 7 2

2 47 45 54 16 3

Would you like to be able to bring to the library your own notebook computer, (equipped with a wireless network card) and connect to the university network?

Table X. Survey, B6

Yes, very much Yes Not sure No Strongly no (Given the ability to do this, I would not) Total respondents:

2 50 11 52 24 4

1 14 10 18 102 13

7 14 8 24 32 153

187 187 187 187 187 187

Response (%)

Response total

70.59 19.79 5.88 2.67 1.07

132 37 11 5 2 187

“trouble-free” is an oxymoron. Experience argues that the library should be prepared to offer or direct people toward every necessary level of technical assistance. Though given little thought by most borrowers, it is an (almost) universally understood library rule that if you borrow an item, you are responsible for its safe return. Because of their cost and relative fragility, portable computers up the “anxiety ante” considerably – at least for some. In fact, 32 respondents (17.11 percent) said that they would not borrow a WNC if, indeed, they were responsible for repair or replacement for a lost or damaged unit, a fee of up to $2,000 (see Table XII). Of course, it would require more questions (a more complex survey) to probe the thinking behind these responses, and perhaps an instrument can be designed to test what is a common understanding of those who manage technology, namely that the end users of high technology usually do not take into account the difference between the organization’s cost for a hi-tech item and the personal retail cost for what appears to be the exact same item. Even if the actual replacement cost for one of these WNCs is substantially less than the original cost, individuals commonly fail to consider the organization’s extended cost for an original or replacement piece of equipment. Library personnel may find themselves explaining why a student will be charged $2,000 for a lost, stolen, or destroyed notebook PC when that person knows that it, or its functional equivalent, is available for less than $1,000. Question B9 (see Table XIII) is informative because it helped the library to gauge the level of expectation of those who might borrow a WNC at the library. These findings would seem to clash with the library’s original decision that the WNCs were to be stripped down to essentials (browser, PDF file viewer, MS Word) – such was the library’s emphasis on their primary use as research and writing tools. It was decided that the sound cards would be disabled to prevent the playing of music (or the full enjoyment of DVDs), but concessions were immediately made in light of the fact that, today, many college courses assume or require access to various types of media (e.g.,

Taking pro-action

569

Referring to the previous question (6.), would you expect the library to help you prepare your notebook PC for use in the library? Response (%) Response total Yes No No, but the library should have some printed instructions for us No, but the library should refer us to a department which can help Total respondents:

Would you hesitate to borrow one of the WNCs if the cost of repair or replacement is your responsibility? No. It’s only fair that the borrower be held responsible No. but I would ’fight it’ if I thought I didn’t deserve the charge Yes it would make me very nervous Yes; in fact I wouldn’t borrow one; it’s not worth the risk Total respondents:

41.18 4.28 44.92 9.63

77 8 84 18 187

Response (%)

Response total

29.41 28.88 24.60 17.11

55 54 46 32 187

Table XI. Survey, B7

Table XII. Survey, B8

LHT 23,4

570

streaming video news coverage). The library staff will re-enable the soundcard upon request with an academic justification. Although a solid majority of respondents claimed to have regular use of a notebook PC, almost 89 percent had never used a library owned notebook computer in a library (see Tables XIV and XV). In other words, to a large majority of respondents, this was a new concept. These patrons were – we can safely assume – very familiar with borrowing library items, and with portable computing, but they appear unsure how the two concepts could work together. In turn, this suggests that library staff would need to be sure that new borrowers understand all the ins-and-outs to borrowing a WNC. That 32.62 percent who had no previous experience with wireless computing could require substantial, even a disproportionate amount of the time of the Circulation Department personnel (see Table XVI). (Very early in the development of this program, employees in the Circulation Department expressed serious concern for the amount of technical support that laptop borrowers might expect them to provide.) The Campus Technology Department of MU worked with the library (as its own test bed) to make wireless access as simple as, “Turn it on, enter your network ID and password, and go.” – just as all Monmouth University people can do on any university desktop PC. (Campus Technology was particularly concerned with the security issues involved.) It is practically impossible to design all possible problems out of a network before it Do you expect to be able to do everything on the WNCs that you normally do on the desktop PCs?

Table XIII. Survey, B9

Yes, the WNC should be just like any PC here for student use Yes, but I think the WNCs should be able to do MORE because they are newer than the desktop PCs No, as long as the WNCs have essential programs and I can “connect” to the Net Total respondents:

How much experience do you have with notebook computers? (Chose the one that comes closest):

Table XIV. Survey, B10

None Some (less than 25 hours) Moderate: I have used one many times Great experience: I own and/or use a notebook PC regularly Total respondents:

Response (%)

Response total

59.36 26.74

111 50

13.90

26 187

Response (%)

Response total

5.35 15.51 17.65 61.50

10 29 33 115 187

Have you used a library owned notebook computer in any other library setting? Response (%) Table XV. Survey, B11

Yes No Total respondents:

11.23 88.77

Response total 21 166 187

opens for business, but the library now had grounds on which to argue that technical support needs to very robust if the wireless laptop computers were to provide a meaningful addition to library services. Question B13 (see Table XVII) may sound like a quiz question but its intent is not malicious. The terminology of wireless networking (wi-fi, etc.) has been working its way into the vernacular for the past several years as the technology has taken on the aura of the next “killer app”. The survey figures suggested that borrowers should be made aware that the library’s wireless network will not be the best choice for downloading or uploading very “fat” files or intensive interaction online (such as real-time video conferencing), especially since 11Mb is the optimal speed of 802.11b, which tends to be, in real-time scenarios, no more realistic than the 56k transfer rate of V90 dial-up modems. Laptop borrowers should be informed about signal drop off, interference, and other characteristics of wireless networks. The numbers in Tables XVIII and XIX suggested that library staff should be prepared to offer at least a “quick and clean” tutorial (something more than “Here is the power button”) to those who feel they need it. PCs have long come loaded with indexed help screens and, commonly, with fair-to-excellent tutorials, but people have tended to ignore them. The authors, in consultation with Guggenheim Circulation staff, have built How much experience do you have with “wireless” computing? None Some (less than 25 hours) I use (or have used) wireless computers often or regularly. I’m a Computer Science major. Need I say more? Total respondents:

What is the maximum network speed that a Wireless Local Area Network, using the 802.11b standard, is capable of? 1 Gb (gigabits) 11 Mb (megabits) 50-100 Mb Don’t know Total respondents:

Would you like the option of being able to plug the WNC into the University’s wired network (LAN)? Yes No. I wouldn’t use it Don’t know Total respondents:

Response (%)

Response total

32.62 29.41 29.95 8.02

61 55 56 15 187

Response (%)

Response total

4.81 18.72 4.81 71.66

9 35 9 134 187

Response (%)

Response total

79.68 5.88 1.59

149 11 27 187

Taking pro-action

571

Table XVI. Survey, B12

Table XVII. Survey, B13

Table XVIII. Survey, B14

LHT 23,4

572

a set of integrated web-based modules on topics such as “Borrowing a library Laptop” and “Getting Help”, and all were tailored to the notebooks purchased and the people (students) who will be borrowing them. Yet, staff was realistic enough to concede that all this on-line assistance may receive little or no independent use by borrowers. We might conclude from Table XX that a large majority of the students who would be borrowing these WNCs would be intent on doing research (presuming all were honest). But note that “Research using the internet”, which 117 (62.57 percent) of respondents gave top priority, nosed out “Research using library resources” at 112 (59.89 percent). These figures might suggest that MU people placed less value on the research resources purchased than those that they have been able to find on the Net (whether free or fee-based). Due to the design of the question, however, it cannot be made certain that the respondents uniformly made the distinction between subscription databases that are accessed via the internet and web sites unassociated with the university. Working that distinction into the question would have made for a longer and perhaps perilously complex survey. This discussion withstanding, the question relates to a WNC program because it may not be a safe assumption that students will see this additional computing power – with the advantage of mobility within the library – as improving their research and work environment. Casting aside cynical interpretations (“We want whatever is new and cool.”), it could be said that the respondents in Table XXI wanted whatever may enhance the library as a learning environment and provide them more comfortable access to the information resources that they need. Question B18 (see Table XXII) is the most complicated in appearance. It asks respondents to imagine themselves in a scenario: a library where they want to use a Would you appreciate basic instruction in using and trouble-shooting WNCs before you borrowed one?

Table XIX. Survey, B15

Response (%)

Response total

29.41 41.71 20.86

55 78 39

8.02

15 187

Yes Yes, if it was very brief No, but give us a user’s manual AND access to the HelpDesk No, just give me the computer Total respondents:

What will you use the WNC for in the library? CLICK the numbers from 1 (highest) to 7 (lowest) according to which activity, in the library, will be most important to you (Only 1 number for each group, please) (REMEMBER & Relax: We cannot learn who you are) Less , , Priority . . More 1 2 3 4 5 6 7 Total

Table XX. Survey, B16

Chat E-mail Games WebCT Research using the internet Working with applications Research using library resources

10 22 9 46 117 61 112

10 23 5 35 35 37 16

16 30 8 34 14 29 18

17 31 8 26 6 27 16

11 40 15 13 3 17 10

40 23 38 19 4 5 8

83 18 104 14 8 11 7

187 187 187 187 187 187 187

computer but no desktop PCs are available or the computing area (what the library calls the “Information Commons”) is crowded and noisy. Five people (2.67 percent) agreed that it would not be worth the time to go through the check-out procedure and then be concerned about returning it on time, and then with no assurance that they would be allowed to check it out again immediately. In fact, it had already been decided that two distinct forms of photo ID would be required and borrowers would need to sign an agreement accepting full responsibility for the laptop. Aware that acceptance of a new service hinges as much on the smooth workings of the associated bureaucracy as on that of the machinery itself, the experienced circulation staff of the library, in reviewing drafts for laptop policy and procedures, soon suggested ways to streamline the check-out/check-in and maintenance. For a laptop lending program – like many other library services – any library would hope to achieve just the right level of popularity so that the service is neither under-used nor over-taxed. Whether or not this is a realistic expectation when lending laptops, the ability to make adjustments with the growth of experience appears to be a necessity and should be built into policy and procedures. The responses to these questions suggest that the expectations and preparedness of wireless laptop borrowers is not entirely predictable. And, from experience with computers and networks, library personnel should know that neither is the behavior of these technologies.

Taking pro-action

573

Part C This section contained only a text box and an invitation for respondents to leave comments or questions. People were assured that this was optional. Also, participants were told that to receive a personal answer, questions needed to be sent by separate e-mail to the e-mail address given. Only two people did that. The 42 comments received represent a fairly large piece of a small pie. Of these 42, 36 (87 percent) had identified themselves as students in Part A, while 4 (9.5 percent) Rate your level of agreement with the following statement: “Wireless computing belongs in the library as one of the technologies offered to students” Less , , Priority . . More 1 2 3 4 5 6 Total Strongly agree Neutral Strongly disagree

110

28

17

If the desktop PCs in the library are all in use or the computing areas crowded. Would you ask for a wireless notebook computer? (Choose the closest fit) Yes, I would definitely ask if a wireless notebook was available Maybe. It would depend on other factors such as how much time is left till closing No. The time it would take to check one out and set it up would be better spent doing other work while waiting for a computer to become available Total respondents

25 4

3

187

Response (%)

Response total

70.59 26.74

132 50

2.67

5 187

Table XXI. Survey B17

Table XXII. Survey B18

LHT 23,4

574

claimed to be faculty and all other groups combined came to 2 (4.8 percent). The comments offered were often insightful, and all were worth reading. Several ideas were repeated often enough to allow for quantification. Sorting the comments revealed the following: A total of 37 of the voluntarily given comments (88 percent) strongly favored a WNC program. Their stated reasons fell into four groups: (1) A WNC program is another step in modernizing, so that Monmouth University can keep pace with other universities (the implicit understanding being that “everyone is doing it”). (2) WNCs will provide an alternative to the PC lab (in the library), which can be crowded, noisy, and uncomfortable. (3) Printing will be made easier because you will not need one of the desktop PCs in order to print from the lab printers. (This observation is based on the reasonable assumption that the wireless computers will have printer access.) (4) WNCs will help individuals better optimize their time in the library The next largest number of comments sharing a thought were the 15 (35.7 percent) who liked the idea but thought that wireless access should cover the entire campus. Some people stated this quite strongly, and even questioned why only the library would go wireless. One comment revealed a gap in the survey that need not have been there. Of the survey respondents, 11 (26 percent) inquired about Apple computer compatibility. The idea that some would question the compatibility of Wi-Fi with any OS (operating system) besides Windows had not been considered. In point of fact, there can be no conflict. Wireless technology is not OS specific.) A total of seven commentators (17 percent) were opposed to having a WNC program in the library, claiming that it was not necessary and the money could be better spent. (A number of ways of spending that money were suggested, including more desktop PCs.) The most curious thing about the argument that “going wireless” is not fiscally responsible is that it stands in stark contrast to one of the major reasons given in all types of literature and on the internet for doing just that, namely that, in most circumstances, a WLAN costs much less to install than wiring for an equal number of LAN-networked PCs. An impressive 20 people (48 percent) favored the idea but worried about such issues as the cost and security. Of the latter, respondents mentioned both physical security, such as the danger of theft or damage, and data security, that is safety from computer viruses and hackers. At least two commentators made clear their understanding that wireless networks cannot be made as secure as wired networks. These reservations suggested that, at the very least, library staff should be prepared to answer questions about security and take measures, in coordination with the Campus Technology Department and Administration, to insure it. Concluding discussion One conclusion virtually begs to be drawn from this short survey: the students of Monmouth University were, at the time of the survey, “ready and waiting” for wireless internet access. Though some expressed reservations, which the library should

address, most were neither surprised nor too excited about the prospect. Overwhelmingly, the respondents believed that this program should be for the benefit of the students; and they assumed that it would be a helpful addition to the technologies now available to them. With this degree of cognitive preparedness, along with the measurable amount of “computer savvy” evident, perhaps the greatest concern the library should have is in meeting the expectations and demand for these WNCs. The authors strongly recommended that the library – and the University – be prepared in the event that the program is a great success. Lack of preparedness for success, so to speak, could result in great disappointment in the library and damage to its image. It is also anticipated that, as the WNCs on hand become, frequently, continuously occupied, the demand will grow for access via the students’ own notebook computers. And students will most certainly continue to want access from anywhere on campus. The potential here is for the library to realize its perennial goal of becoming ubiquitous. Slightly short of that, wireless access, complementing off-campus access, will make the library more accessible to, and, the authors believe, far more often accessed by students, faculty, and all members of the University’s community. References Dugan, R.E. (2001), “Managing laptops and the wireless network at the Mildred F, Sawyer library”, The Journal of Academic Librarianship, Vol. 27 No. 4, pp. 295-8. Green, K.C. (2003), The 2002 National Survey of Information Technology in US Higher Education: Campus Portals Made Progress: Technology Budgets Suffer Significant Cuts, available at: www.campuscomputing.net/pdf/2002-CCP.pdf (accessed June 14 2003). Note that this is not the 2002 Campus Computing Report itself, but an extended prećis. Pitkin, P. (2001), “Wireless technology in the library: the RIT experience: overview of the project”, Bulletin of the Society for Information Science and Technology, Vol. 21 No. 3, pp. 10-16. Further reading The Journal (2000), “Colorado State University expedites student research with library-wide wireless network”, The Journal, September, available at: www.thejournal.com/magazine/ vault/A3017.cfm?kw=842 LaGeese, D. (2003), “A PC choice: dorm or quad?”, US News & World Report, Vol. 134 No. 15, p. 64. Appendix. Terminology used Wi-fi: Short for wireless fidelity. The term Wi-Fi was created by the Wi-Fi Alliance, an organization that oversees tests that certify product interoperability. † Wireless fidelity: Wireless Fidelity (WiFi) is a term for certain types of local area network that transmit and receive over certain radio frequencies. For a good, moderately technical explanation, with links to greater detail, see http://searchmobilecomputing.techtarget.com/ sDefinition/0,sid40_gci838865,00.html † WLAN: Wireless Local Area Network: Radio signals are used instead of co-axial, twisted strand, or other cable to connect servers, client computers, and peripherals in as area that may be as small or as large as what a traditional LAN may cover. Depending on the version of the IEEE standard 802.11 that is used, the speed of a WLAN may be as slow as 1/10th that of a standard LAN or faster.

Taking pro-action

575

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

OTHER ARTICLE

A statewide metasearch service using OAI-PMH and Z39.50

576 Received 12 February 2005 Revised 27 March 2005 Accepted 27 March 2005

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

Joanne Kaczmarek University Archives, University of Illinois Library at Urbana-Champaign, Urbana, Illinois, USA, and

Chew Chiat Naun Serials Cataloguing, University of Illinois Library at Urbana-Champaign, Urbana, Illinois, USA Abstract Purpose – The purpose of this paper is to describe the Illinois LSTA grant-funded project, “Yellow Brick Roads: Building a Digital Shortcut to Statewide Information”. The project investigated the feasibility of unified searching across library holdings, digitization projects, and online state government information through use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) in tandem with the Z39.50 protocol through application of the Z39.50/OAI Gateway Profile. Design/methodology/approach – The project proceeded through the construction of a metasearch service model based on the Z39.50/OAI Gateway Profile. Technical obstacles encountered during the construction of this demonstration service were noted, as were potential solutions. The evaluation of the Z39.50/OAI harvesting component of the Gateway Profile was conducted by means of a questionnaire of vendors providing Z39.50 servers to the Illinois Regional Library Systems. Findings – The established technology platform provided by the University of Illinois Open Archives Initiative (OAI) Metadata Harvesting Project proved to be adequate to data sets of this size and character. However, the project concluded that the Z39.50/OAI Gateway Profile could not be deployed because of limitations in the functionality of typical Z-servers. Research limitations/implications – The project concentrated on the technical aspects of building such a service model rather than on the usability of the interface or on questions of interoperability at the metadata level, such as to what extent the vocabularies used by the different metadata communities was compatible. Originality/value – The project’s findings indicate that more labor intensive, or less timely, processes of aggregating records than that envisaged by the Z39.50/OAI approach will continue to be necessary. However, further investigation of hybrid approaches hold promise. Keywords Z39.50, Information retrieval, Government Paper type Research paper

Library Hi Tech Vol. 23 No. 4, 2005 pp. 576-586 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636355

Introduction State and local libraries continue to seek ways to provide convenient, reliable web-based search interfaces to library resources alongside reliable state government information. These efforts are aimed at supporting the public’s general information needs. To provide this type of service solutions must be developed that do not over burden library support staff. In an effort to test feasible solutions, the Library of the University of Illinois at Urbana-Champaign was awarded an Illinois Library Services and Technology Act (LSTA) grant entitled “Yellow Brick Roads: Building a Digital Shortcut to Statewide Information”[1]. The aim of the project was to build an easy-to-use search service model

using automatic record aggregating and indexing technologies to provide simultaneous access to statewide library holdings records, state-funded digitization project files, and state government Web site content. Work on the project proceeded over a seven month period beginning in December 2002[2]. In this initial phase, the project concentrated on the technical aspects of building such a service model rather than on the design of the interface or on questions of interoperability at the metadata level, such as to what extent the vocabularies used by the different metadata communities was compatible. This project was inspired by two separate factors. The first was an interest on the part of statewide librarians to provide a reliable and convenient method for anyone to search online statewide government information along with library holdings. The second was the promising outcome of the University of Illinois at Urbana-Champaign Open Archives Initiative (OAI) Metadata Harvesting Project, one of the first projects to explore the extensibility of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)[3]. Extending the technologies used in the OAI Metadata Harvesting Project[4] could provide support for developing easy simultaneous access methods to multiple library catalogues and statewide government information. The primary goal of the Yellow Brick Roads project was to test the extensibility of the established technology platform provided by the OAI Metadata Harvesting Project, indexing over 11 million records representing statewide library holdings, state-funded digitization project files, and state government website content. This goal was achieved. A secondary goal was to develop software tools such as ZMARCO (http://zmarco. sourceforge.nt/README.html) and to determine the feasibility of using it to automatically gather machine-readable cataloguing (MARC) records from the 12 Illinois Regional Library Systems (RLS). ZMARCO acts as a data provider allowing MARC records available through a Z39.50 server to be made available via the OAI-PMH. The results of a survey of the Z39.50 vendors supporting the Illinois Regional Library Systems indicated there were too many variables within the Z39.50 implementations to allow for seamless automatic harvesting of MARC records using the OAI-PMH/Z39.50 model[5]. Current common technological practices do not render automatic collection aggregation of MARC records, library digitization project files, and statewide government web site content feasible. Even if developments in technology make these automatic processes possible in the future, the usefulness of such an aggregation has not been clearly established. Possible “hybrid” harvesting approaches should be explored further and extensive user studies should be conducted prior to deploying search interfaces to heterogeneous library collections. Background Libraries aspire to promote freedom of access, supporting access to a broad extent of resources for every imaginable information-seeking pursuit. With the introduction of Online Public Access Catalogues (OPACs), Inter-Library Loan services (ILL), and broadcast searching and open linking technologies, relatively easy access to library resources across geographic and disciplinary boundaries has become an expected norm of library users. The introduction of the internet into the information landscape expanded library user expectations. Many users now expect the full-text of any type of publicly available information resource to be readily available via the web with the Google styled single-box search interface.

Metasearch using OAI-PMH and Z39.50 577

LHT 23,4

578

The recognition of the importance of digital technologies and information access in the context of libraries led to the 1994 announcement of $24.4 million US federal funds in support of digital libraries initiative (DLI-1) research (www.dli2.nsf.gov/dlione/). This work spurred on further research and demonstration projects throughout the ensuing decade with more focus on the practical applications of digital technologies. Projects like the Colorado Digitization Program (CDP) (www.cdpheritage.org/) tested available technologies for digitization and developed methodologies for providing online access to digital information resources. Other projects, like the Mellon-funded Open Archives Initiative for Metadata Harvesting Protocol projects explored the development and extensibility of interoperability standards. As one of the seven Mellon-funded projects, the University of Illinois at Urbana-Champaign (UIUC) developed the Open Archives Initiative Metadata Harvesting Repository (http://nergal.grainger.uiuc.edu/cgi/b/bib/ bib-idx). The UIUC repository was designed to provide access to a broad range of heterogeneous records representing cultural heritage collections from across the world. Methodology Record procurement and processing The “Yellow Brick Roads” project initially obtained static batches of data from three discrete record sets without regard for what harvesting mechanism would be used to update the records at a later time. These three sets included: Illinois State Library MARC records, representing the holdings of most member libraries of the twelve Illinois Regional Library Systems; the Illinois Digital Archive (IDA), representing documents, pictures, and artifacts from state-funded digitization projects completed by various libraries and historical societies throughout the state; and a variety of content from Illinois State Government web site crawls completed by the Preserving Electronic Publications (PEP) (www.isrl.uiuc.edu/pep/) research project at the University of Illinois’ Graduate School of Library and Information Science (GSLIS). All records underwent post-harvesting processing to convert the records into a standard format prior to indexing them with XPAT, a search engine for structured text from the University of Michigan, in use at the University of Illinois for various projects including the original OAI Metadata Harvesting Project. The initial MARC records were obtained through ftp transfer. They were provided to the project team by a commercial vendor contracted by the Illinois State Library to consolidate quarterly MARC updates from the regional libraries and process them for submission to the Online Computer Library Center (OCLC). These records were provided as contiguous files that we converted to XML files in the format of Dublin Core (http://dublincore.org/) in preparation for indexing by XPAT (see Figure 1). We extracted 10,290,463 records directly from 47 binary files. We removed approximately 15,000 records that were determined either to be probable duplicates based on ISBN and ISSN matches or to contain illegal XML characters in the 900 field. There were three basic steps taken to process the MARC records prior to indexing them. We created a program to extract records from the binary files and construct MARCXML files for each record. This step excluded the duplicate records. We then ran a program to transform the MARC XML records into the Dublin Core format. This step excluded the records whose 900 field contained illegal XML characters. We then created one index for each regional library system with several of these indexes needing to be broken up even further due to efficiency limitations of XPAT that required no more

Metasearch using OAI-PMH and Z39.50 579

Figure 1. MARC records processing

than approximately 2 gigabytes per index. The project programmer took approximately 50 hours to process and index the MARC records. The IDA records were obtained through ftp transfer. They were provided to the project team by the technical support manager of the Illinois Digital Archive and represent the smallest of the three datasets at 1,438 records. These records required very little post-harvest processing time, prior to indexing, approximately three hours, because they were already in the OAI-PMH compliant Dublin Core XML format[6]. Illinois state agency homepage web sites and links from those pages were used to represent statewide government information. The 47,103 URLs and content of the , meta . tags from state government web sites were provided to the project team from the PEP Project Director, Larry Jackson. As expected, the information from the , meta . tags was too sparse to be considered adequate in representing the content displayed on the associated web pages and could not be used alone as a record of the statewide government information. For example, we found that in some cases the same generic metadata would be used in every page on a web site, probably as a result of the use of a template. In such cases the lack of granularity in the metadata would make it difficult to retrieve pages containing specific information. To enhance these records, the project team decided other significant content could be found by pulling content from other tags. We created our own web crawler and deployed it on the state agency URLs. Our web crawler captured the content of , title . and , href . tags for each page. The resulting more robust records were transformed into OAI-compliant Dublin Core XML format. In the interest of exploring other possible enhancements to the state government information records we also deployed a web crawler provided by the Visualization and Virtual Environments Group (VIAS) at the National Center for Supercomputing Applications (NCSA) (http://vias.ncsa.uiuc.edu/). The results of the VIAS web crawl provided a topical summary of the web pages based on an automated analysis of the full text and pulled out personal names and names of organizations. We then indexed the web site records three different ways: (1) With only the content from the project team web crawl. (2) With only content from, the VIAS web crawl.

LHT 23,4

580

Figure 2. State government web site records processing

(3) With content merged from both web crawls. We chose to use the content merged from both web crawls and used the original records provided from GSLIS solely as an authority file of URLs (see Figure 2). OAI-PMH/Z39.50 gateway profile The Z39.50 protocol is an ANSI/NISO standard for communication protocols defining how bibliographic databases will communicate with one another for the purposes of enabling information retrieval. The protocol is most commonly used as a means of querying library catalogues, either for public access purposes or for retrieval of cataloguing data by library staff. It is also used for querying article databases, often for the purpose of downloading citation data into bibliographic management software such as EndNote. The protocol is perhaps most strongly associated with its use in federated searching across multiple bibliographic databases, both in its current form and in the Search and Retrieve Web Service and Search and Retrieve URL Service (SRW/SRU) (www.loc.gov/z3950/agency/zing/srw/) protocols which have more recently been derived from it. Z39.50 is prima facie attractive for harvesting applications because it is widely supported by library database management systems and because in most implementations it returns complete MARC records. However, the use of this standard to achieve interoperability can present a challenge because the range of transactions that is definable within the standard is far larger than what is normally possible in any given implementation of it. Different implementations often have functionality that only partially overlaps. Examples of varying functionality include whether or not the retrieval system will support searching on specific fields, or sorting result sets. The way these constraints have been addressed is to create Z39.50 profiles in which the environment of the Z39.50 server and the underlying information retrieval system are configured so as to result in the desired outcome of an information retrieval query. The OAI-PMH/Z39.50 Gateway Profile (http://frasier.library.uiuc.edu/ research.htm) was created by the project team member, Tom Habing, for the purpose of creating an appropriate response to OAI-PMH requests layered over the Z39.50 server. The details of this profile were included in questions asked of vendors supporting Z39.50

library systems to establish the feasibility of using OAI-PMH to automatically harvest library bibliographic records as part of the “Yellow Brick Roads” project. ILMS vendor questionnaire A questionnaire aimed at integrated library management system vendors (ILMS) was created by the project consultant, William E. Moen, in consultation with UIUC project staff. The questionnaire was designed to determine the feasibility of interfacing the OAI-PMH with the Z39.50 standard and running OAI Data Provider scripts on top of the twelve Illinois Regional Library Systems’ Z39.50 servers. Out of the 12 Regional Library Systems, six separate vendors were represented of which only four responded to the questionnaire.

Metasearch using OAI-PMH and Z39.50 581

Software and hardware configuration The “Yellow Brick Roads” project technical infrastructure followed on the infrastructure of the University of Illinois OAI Metadata Harvesting Project infrastructure. The technologies used were the OAI-PMH for harvesting and the University of Michigan’s DLXS and XPAT tools for indexing and searching. Initial project plans for the “Yellow Brick Roads” project were to recompile the XPAT software for use on a network of 64-bit processor machines to simulate a powerful distributed environment. We were unable to recompile XPAT and chose instead to test XPAT on a single 32-bit Dell PowerEdge 4600 Intel Xeon with 12 gigabytes of RAM and two 2.6 GZ processors. This configuration provided very satisfactory performance. Search results were returned between 3 to 8 seconds with four simultaneous searchers using the same search string. During the course of the project the search portal was available for one month for an informal user-testing period, resulting in slight modifications to the interface. Project findings Feasibility of harvesting records using OAI-PMH and Z39.50 The Z39.50 OAI Server Profile was developed to support harvesting of records using a simple OAI-PMH gateway. This profile describes how a Z39.50 server along with its associated bibliographic database could be turned into an OAI-PMH data provider by putting a gateway on top of the Z39.50 server that implements OAI-PMH (see Figure 3).

Figure 3. Z39.50 to OAI gateway

LHT 23,4

582

The gateway was designed to simultaneously act as a Z39.50 client and an OAI Repository translating OAI requests into Z39.50 requests and packaging the Z39.50 responses into OAI responses. This would require certain characteristics to be present in the underlying data structures and search mechanisms of the Z39.50 server implementations. In particular, it would require a unique identifier for each record, a way to provide a date stamp, and the means to retrieve records according to criteria specified in terms of these data. To assess the feasibility of using OAI-PMH with Z39.50 servers to harvest MARC records from the Illinois Regional Library Systems, we distributed a questionnaire to the Illinois Regional Library Systems’ ILMS vendors. The respondents indicated that the current capabilities of the ILMS vendors as well as their Z39.50 server implementations do not support the parameters specified in the Z39.50 OAI Gateway Profile document. In the questionnaire report summary, Moen (unpub.) notes: The vendors responding indicated the current functionality of the online catalogues in general does not support the types of search and retrieval needed to harvest records according to the parameters specified by the Z39.50 OAI Gateway Profile. Similarly, the Z-server products currently deployed do not, in general, support the specific Z39.50 requirements in the Z39.50 OAI Gateway Profile document. In the near-term, it seems unlikely that the Z39.50 protocol can be an effective approach for harvesting large quantities of records from the bibliographic databases. Several of the responding vendors suggested that the Z39.50 protocol is not an efficient tool for such bulk harvesting.

The results of the questionnaire found that the ILMS systems fell short of supporting OAI-PMH capability via Z39.50 in two ways. First, the required search and retrieval functionality was not wholly present in any of the systems we surveyed. In hindsight, this finding should have come as no surprise. The capabilities of a Z39.50 server are typically built on the underlying search functionality of the target system. In the case of an ILMS this is essentially the search functionality of the OPAC, which supports traditional library catalogue searches such as title, author and subject, and tends to limit result sets to a size that a patron might be expected to be able to deal with. The large-scale database operations performed behind the scenes, for such purposes as reporting and global database maintenance, are typically not exposed via either the OPAC search interface or Z39.50. The types of retrieval required by the Z39.50 OAI Gateway Profile are mainly of the second kind. They include the ability to export all of the records in the database, the ability to sort by record identifiers and system transaction dates, and the ability to filter results by a variety of date criteria. Although attribute values are defined in the Z39.50 standard for these processes, it does not follow that any given system will support them, and in particular not the part of the system designed for library patron use. Therefore, it would require a major development effort for the vendors we surveyed to enhance their systems to support the goals of OAI. To point out these practical limitations of existing Z39.50 implementations is a criticism neither of the vendors’ design decisions nor of the Z39.50 standard itself. It is merely to emphasize that a realistic view of Z39.50 must recognize its role as a communication protocol rather than as a framework for interoperability in any more substantive sense (Lynch, 1997). The second and more surprising finding of our questionnaire was that the interpretation of MARC control fields was not consistent from one system to another,

or indeed necessarily from one implementation of a given system to another. In order to harvest the records automatically using OAI-PMH it is necessary to have a way to identify each record persistently within a database and to identify all the records that have been added or changed within a given range of dates. Here again, Z39.50 defines use attributes, 12 and 1012 respectively, for searches on those data. But there turns out to be no consistent mapping between these use attributes and MARC tags. That is surprising because MARC defines fields that, on the face of it, carry precisely the required data. MARC 001 is the local database record identifier; 008/00-05 is the date the record was created, and 005 carries the date of latest transaction. But our survey found that some systems retain the 001 or 005 data from the originating database (usually a bibliographic utility such as OCLC) while recording the corresponding data for the local system elsewhere, if at all. It is worth emphasizing that a system that indexed and exposed these data via Z39.50 would still be compliant with the Z39.50 standard, since the protocol says nothing about the mapping of data elements to MARC tags. But the existence of varying implementations of MARC added an unexpected and unwelcome layer of complexity to the project. Moreover, it seems likely that the four vendors represented in our survey do not exhaust the variations that exist in the wider library automation community. There may be other potentially viable approaches to harvesting besides the one envisaged by the Z39.50 OAI Gateway Profile. A hybrid approach that the project team discussed would involve taking an initial load of MARC records from each system by means of its export function and then harvesting each system periodically using Z39.50 with ZMARCO. This incremental approach would avoid the single heaviest requirement that the Z39.50 OAI Gateway Profile places on complying systems, which is the ability to return the entire contents of the database on demand. But for reasons we have just mentioned, even this compromise approach is not currently achievable. An alternative model that may be more feasible in the short term, and in certain contexts, is to create a mechanism to extract regionally scoped subsets of records at regular intervals from a central source like OCLC. However, this approach would be philosophically quite different from the distributed, more generalized approach of OAI-PMH and was beyond the focus of this project. Deduplicating records An ongoing challenge when merging records from various sources is how to remove the visual clutter created by duplicate records representing the holdings of multiple libraries. This is a familiar problem faced by any library trying to manage its own database, but the problem gets out of hand very quickly when one tries to represent the accumulated holdings of a dozen or more libraries. If a patron is searching an aggregated database for a copy of Alice in Wonderland, she would be best served by viewing just one instance of the record representing the book – or at least each distinct edition of the book – followed by a link to the holdings of each of the libraries that has a copy of the book (Tillett, 2001). Ideally, the link would be live, enabling the patron to see the current availability of each copy. This process is called deduplication and requires finding matching points within the MARC records for each instance of the resource. Unfortunately, it is difficult to find reliable match points in MARC records. The ISBN generally works well when present, but many records do not have ISBNs.

Metasearch using OAI-PMH and Z39.50 583

LHT 23,4

584

Moreover, its use can lead to unintended matches in some cases, as when the same book is catalogued both as a separate title and as part of a set. A more reliable match point is the OCLC number, but our experiences in this study showed that even this identifier can be problematical. The most obvious difficulty is that not all catalogue records are OCLC-derived, even in the catalogues of libraries that use OCLC as their primary source of cataloguing. But even records that are OCLC-derived cannot always be deduplicated against each other. Depending on the ILMS and the workflow in use, the OCLC number may not always be stored in the local copy of the MARC record, and it may not always be in the same field or even the same format. For example, we saw some records that had an alphabetical prefix in front of the OCLC number and others that lacked it. In the record sets we examined the major problem was the absence of OCLC numbers from a substantial percentage of the records, whether because they were not sourced from OCLC or because the numbers were not retained. Had we attempted to perform deduplication, we could probably have achieved reasonably good results with OCLC number and ISBN matching, but there would still have been quite a large number of duplicates in this data set. In the end, we did not attempt to deduplicate, but chose instead to sidestep the problem by creating separate databases for each of the Regional Library Systems and allowing them to be searched simultaneously. We also considered, at least in principle, abandoning the attempt to construct databases for the MARC-derived records and providing a Z39.50 client pointing to the Regional Library Systems instead. Whichever of these approaches is taken, the problem of duplicates arises only when searching more than one of the databases at the same time, and the problem is manageable as long as only a few databases are being searched. But clearly it would still be best if it were possible to combine duplicate records in some manner. However, because the date stamp associated with a record may not be consistently coded from one database to another it may not be possible to determine which of several copies of the same record is the best or latest. Deduplicating records also implies being able to merge certain types of data, such as holdings, but it is often difficult to tell when data is redundant and when it is not. The problem of how to deduplicate records is closely related to the problem of how to harvest them. The key to both tasks is to be able to match records and to identify the most current version of each one. The difficulties that we encountered in the course of this project, taken together with the results of our questionnaire of ILMS vendors, lead us to conclude that the implementation of MARC control data such as record identifiers and system dates is not sufficiently standardized across various ILMS platforms and their individual implementations to support automatic harvesting through the OAI-PMH. That remains so whether the harvesting is attempted using a Z39.50 connection other means. Because of this lack of consistency of implementation, the matching, deduplication and updating of MARC records continue to work best within the confines of a single database or consortia network of databases, where uniform standards and procedures can be enforced, and very specific matching and merging algorithms can be developed. The database in question can be that of a single library all the way up to that of a massive centralized utility like OCLC.

Summary and further research While the technique of harvesting directly from Z39.50 servers using OAI-PMH to obtain MARC records seemed to be an elegant solution in principle, developing relationships with the caretakers of these records and arranging for static harvests of records via FTP tools proved a more practical approach to procuring the records. We believe that arranging regular batch processing of all data sources is a reasonable, if less than optimal process for getting data updates. In the case of bibliographic records, it would be recommended that the records are linked to each library’s holding database so a patron may conveniently check on the availability of desired items. Our project struck a number of commonly known problems associated with aggregations of heterogeneous records. These include problems of duplication, sparse metadata, and markup inconsistencies. Such problems continue to limit the usefulness of aggregations of heterogeneous records to the average users of library online resources. These limitations are compounded by search interfaces that are created during the deployment of test projects to be used as technical utilities during the project without consideration of the search experience of the users for whom these systems are supposedly being built. Running a robust and thorough user test on the search interface was beyond the scope of this project but will require consideration as future research continues to explore better ways for representing heterogeneous collections of data. Further research should also explore a hybrid approach to heterogeneous repositories that include MARC records. Sending a search query simultaneously as a request to an aggregator such as an OAI aggregator, and as a federated query to Z39.50 servers may prove to be a sensible alternative approach to universal searching. The key to success for libraries intending to provide their users with easy-to-use seamless access to a broad range of materials requires two foci. The first is the thoughtful application of available technologies based on an informed understanding of existing data and available standards. These efforts need to be carefully documented. The second is a focus on the interface of returned results based upon the community for whom the system is being developed and the information needs and preferences they express. This project began to address the first component of this “formula for success” but we cannot overstate the need to give substantial efforts and attention to the second component.

Notes 1. LSTA grant # LSTA-03-0202-1082. Details of grant program available at: www.imls.gov/ grants/library/lib_gsla.asp 2. In addition to the authors, the project team included co-leader Tom Habing and programmers Yuping Tseng, Qibo Zhu and Qiang Zhao. Larry Jackson and project consultant William E. Moen (2001) also made significant contributions to the project. 3. Open Archives Initiative for Metadata Harvesting Protocol: http://openarchives.org 4. University of Illinois OAI Metadata Harvesting Project: www.oai.grainger.uiuc.edu 5. Z39.50 document: http://lcweb.loc.gov/z3950/agency/Z39-50-2003.pdf 6. Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting: www.openarchives.org/OAI/2.0/guidelines.htm

Metasearch using OAI-PMH and Z39.50 585

LHT 23,4

586

References Lynch, C. (1997), “The Z39.50 Information Retrieval Standard, Part I: a strategic view of its past, present and future”, D-Lib Magazine, April, available at: www.dlib.org/dlib/april97/ 04lynch.html Moen, W.E. (2001), “Interoperability and Z39.50 profiles: the Bath and US national profiles for library applications”, ALCTS Newsletter, Vol. 12 No. 4. Tillett, B. (2001), “Bibliographic relationships”, in Bean, C.A. and Green, R. (Eds), Relationships in the Organization of Knowledge, Kluwer, Dordrecht, pp. 19-35. Further reading Shreeves, S.L., Kaczmarek, J.S. and Cole, T.W. (2003), “Harvesting cultural heritage metadata using the OAI protocol”, Library Hi Tech, Vol. 21 No. 2, pp. 159-69.

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

OTHER ARTICLE

Similar interest clustering and partial back-propagation-based recommendation in digital library Kai Gao, Yong-Cheng Wang and Zhi-Qi Wang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China

Similar interest clustering

587 Received 13 July 2004 Revised 28 March 2005 Accepted 27 April 2005

Abstract Purpose – This purpose of this paper is to propose a recommendation approach for information retrieval. Design/methodology/approach – Relevant results are presented on the basis of a novel data structure named FPT-tree, which is used to get common interests. Then, data is trained by using a partial back-propagation neural network. The learning is guided by users’ click behaviors. Findings – Experimental results have shown the effectiveness of the approach. Originality/value – The approach attempts to integrate metric of interests (e.g., click behavior, ranking) into the strategy of the recommendation system. Relevant results are first presented on the basis of a novel data structure named FPT-tree, and then, those results are trained through a partial back-propagation neural network. The learning is guided by users’ click behaviors. Keywords Information retrieval, Knowledge mining, Digital libraries Paper type Research paper

Introduction As the result of the boom in the use of the web and its exponential growth, obtaining desired information can be challenging. According to some reports in 2005, 86.6 percent of people on the Internet use search engines (CNNIC, 2005). But at the same time, users are finding it still hard to satisfy their needs. Although many strategies have been tried according to the literature (e.g., Duchamp, 1999; Nanopoulos, 2003), users still seem to be unsatisfied with the performance. So the satisfaction ratio for search engines is only 28.4 percent (CNNIC, 2005). The main reason is that many results are irrelevant because they cannot take individual interests into account. This triggers an ongoing need for efficient retrieval strategies. Recommender-systems offer a strategy for taking individual interests into account. A recommender-system is defined as one that adapts knowledge gained from user’s navigational behavior and returns relevant results. Nowadays, systems that can provide personal recommendations have gained a lot of interest. Some online stores, e.g. Amazon.com, use approaches based on history and customer ratings to recommend relevant results to customers. We think both query statements and click behaviors partially reflect the interests of users. On one hand, two queries containing some similar terms may denote the similar needs. But users with diverse interests want to retrieve different information in

Library Hi Tech Vol. 23 No. 4, 2005 pp. 587-597 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636364

LHT 23,4

588

response to the same query terms. For example, “bikini” can mean a “swimming suit”, a “beauty queen” or even an “atomic bomb experiment”. This is particularly the case for short queries. On the other hand, click behavior is noisy because users may click on irrelevant results sometimes. The goal is “4-R Service”: the Right information serves the Right person at the Right time in the Right way. Some information retrieval systems add the ability to customize results by means of recommender-systems. In this paper, we propose a two-step approach: (1) In order to analyze common interests in a reasonable way, a novel data structure named FPT-tree is proposed to identify common interests among users. (2) The results are trained by means of a neural network that takes into account the web pages that users have accessed and their common interests. The connecting-weight is adjusted by means of a learning algorithm. Through analyzing the experimental results, we can conclude that the system performs well. Although the experimental platform is on a stock data set, this algorithm has broader digital library implications. The reason is that not all digital library information is text-based, and stock prices represent one realistic form of digital library information that is composed of both complex text and numeric elements. The remaining sections are organized as follows: In the next section, related work about this field is briefly described. The section after this focuses on details about how to get common interests among users. This is followed by a discussion on some details about how to train the neural network to make recommendations. The penultimate section presents experimental results and analysis. Finally, conclusions are shown in the final section. Related work Recommendations have been used in many fields, such as recommending products or identifying web pages that will be of interest. Recommender systems advise users on relevant products and information by predicting a user’s interest in a product, based on various types of information such as users’ past purchases and product features (Huang et al., 2002). Schafer et al. (2001), list six categories as most current recommender systems: raw retrieval, manual selection, statistical summarization, attribute-based, item-to-item correlation, and user-to-user correlation. Related works have been presented in some literatures. In (Wen et al., 2002), the authors describe a new query clustering method that makes use of user logs. By implicitly borrowing strength from related queries, the authors in Hansen and Shiver (2001) propose a mixture model and allow users to identify more highly relevant URLs for each query cluster. In Chuang et al. (2002), the authors organize query terms into a hierarchical structure and construct query taxonomy by means of a clustering algorithm to group similar queries. In Xu et al. (1996), the authors show pseudo-relevance feedbacks in reformulating queries by using those top-ranked documents. Besides, some pattern-growth-approach algorithms, i.e. Prefix Span (Han et al., 2004) and gSpan (Wang et al., 2004), can be used to mine frequent patterns efficiently. Unlike traditional method based on the anti-monotone property, Chiu et al. (2004) proposed a new strategy called DISC, which prunes the non-frequent sequences according to the other sequences with the same length. Huang et al. (2002), used a graph-based recommender

system that naturally combines the content-based and collaborative approaches. Here a Hopfield algorithm is used to exploit high-degree “book-book”, “user-user” and “book-user” associations in a digital library. Dong et al. (2003), propose a competition-based neural classification algorithm, which combines the advantages of adaptive resonance theory and the competitive neural network together. In contrast to these previous works, this paper presents a novel data structure, which combines the benefits of the FP-tree and users’ interest to identify common interests among users. A partial back-propagation training algorithm is presented, and learning is guided by users’ click behaviors. How to obtain common interests? In this section, we propose an approach about how to obtain common interests. The goal is to get those frequent patterns (i.e. popular pages) within a group. Those frequent patterns can be used as common interests for recommendation. We use a novel data structure named Frequent-Pattern-Time tree to do so. This data structure, coming out of Frequent-Pattern tree structure in frequent pattern mining, can find those popular pages effectively. An FPT-tree is a connected graph, which can be denoted as a triplet T(R, N, L), where R is the root node and N is a set of frequently browsed documents, which can be denoted as a quaternion (document label, frequentness, linger-time, score). L is the linking for graph traversal. Three steps are needed. In the first step, it is need to scan those data and find frequent pages by comparing them with a given min_threshold. Then, those data can be arranged in a descending order according to individuals’ click behaviors. The benefit is that, according to Frequent-Pattern tree, “more frequent occurring pages are more likely to be shared and thus they are arranged to be closer to the top of the tree”. As a result, it can be shared more often. Last, those ordered frequent pages could be inserted into the FPT-tree one by one in multi-layers. Suppose we have sorted click streams by individuals in a descending order. Each node in the structure consists of five fields: item-name, frequentness, linger-time, score, and node-link. In detail, Item-name registers a document and frequentness means the number of documents represented by the portion of the path reaching this node. As for linger-time, it is a factor to measure the user’s interest. The reason is that if the user pays more attention to a web page, it may take him or her more time to browse. So linger-time can partly reflect a page’s important degree to the user. But users could linger due to complexity, poor writing or boredom leading to a coffee break. How to solve this problem and how to determine the linger-time for a “perfect” page? On the basis of the experimental platform shown later, we find that over half of users have duration less than one minute, and two-thirds have duration less than three minutes. So, if the linger-time is longer than 20 minutes, the corresponding page will be dropped then. Score is a node’s value of interests, which can be calculated according to formula (1). Score ¼

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 £ frequentness þ linger time layer number

In order to compare the difference between FP and FPT-tree, we use the same example shown in (Pei, 2002) as an illustration. Table I consists of every user’s click stream. By using the min_threshold, a descending order frequent pattern can be obtained by omitting those non-frequent items. In Figure 1, each node’s number after the first “,”

Similar interest clustering

589

LHT 23,4

590

indicates the frequentness and each node’s number after the second”,” indicates the total linger-time on this site. If two users click a document in a same layer, the linger-time is the sum. The number besides each node is its score according to formula (1). For every user, ordered frequent items can be inserted into one branch of the FPT-tree. The detail is shown in the pseudo-code, see Figure 1. Given the threshold min_score (an experimental value), it is easy to get the common interests. For example, if the min_score is 2, the result set is {F, A, B, C, P}, (see Figure 2). Recommendation We suppose that users within a group (i.e. domain) have some similar interests. In the stock prototype, we organize the data into three groups (i.e. finance news, annual report, and stock comments). Recommendation can be done by means of training. The training is on the basis of a neural network, which consists of the user’s personal interests and those common interests. Of course, an effective training approach is needed here. A three-layers neural network is used for recommendation. It consists of an input layer, a hidden layer (documents layer) and an output layer (evaluations layer), see Figure 3. As for the neural network, nodes take the input and use the weights together with a Sigmoid function to compute the output values. Input layer elements connect those documents (hidden layer), which are the unions of those web pages a user has accessed and those common interests within a group obtained from the FPT-tree. On one hand, actual output takes input from hidden nodes and combines the previous layer node’s values into a single one. On the other hand, target output represents the user’s real evaluation, i.e. clicking behavior. Learning is accomplished by modifying the weights on the basis of users’ feedbacks, which can be thought of a supervisor. It needs to adjust connecting-weights

Table I. Content groups

Figure 1. Figure1 FPT-tree generation

User ID

Click stream

Ordered frequent items

1 2 3 4 5

F,A,C,D,G,I,M,P A,B,C,F,L,M,O B,F,H,J,O B,C,K,S,P A,F,C,E,L,P,M,N

F,C,A,M,P F,C,A,B,M F,B C,B,P F,C,A,M,P

Similar interest clustering

591

Figure 2. FTP-tree structure

Figure 3. The architecture of the neural network

by comparing actual output with target output. The next question is how to determine the connecting-weights. Traditional back-propagation approach works by making modifications on weighting-values starting at the output layer and then moving backward through the hidden layers. In this paper, the algorithm is modified here because we only need to adjust some special weights, not all of them. If a user’s evaluation is far away from the recommend result, i.e. the error between the target output and the actual output is larger than the given threshold, it is necessary to adjust the connecting-weights of those documents (the hidden layer) having connecting-weights from the user. So we name this algorithm as partial back-propagation. That is to say, we needn’t adjust all weights from the input layer to the hidden layer.

LHT 23,4

592

How to measure those results? We think that the majority of users pay more attention to “top-10” results (the reason is shown in the following section, Table II). If a user has browsed a page at x’s order (x # 10), we can weight the connecting-weights by using (10-x). On the basis of the experiments, we fetch “top-10” retrieved results to calculate users’ interests because most users usually pay more attentions to those “top-10” results. Function shown in formula (2) can value a document. It means that the document in the mth session is visited at the xth order, where n means there are n sessions will P be calculated. If a user browses a web page at the xth order, rankm ðxÞ ¼ ð10 2 xÞ. For example, if session m’s data in interval t[ [Tm, Tmþ 1] is {doc1, doc3, doc2, doc1, doc5, doc3, doc1, doc2, doc1, doc5}, then rankm ðxÞ‘ , s value of document1 ¼ (10-1) þ (10-4) þ (10-7) þ (10-9) ¼ 19. If sessionmþ 1 s data in interval t[[Tmþ 1, Tmþ 2] is {doc2, doc3, doc2, doc1, doc5, doc1, doc2, doc6, doc7, doc3}, then rankmþ1 ðxÞ‘ , s value of document1 ¼ (10-4) þ (10-6) ¼ 10. So value (document1) ¼ 29. Because the target must fall in the interval [0,1] when training, a method shown in formula (3) is used to transform those data into this range (Roiger, 2003). For example, if the set of the documents’ evaluation is {29, 16, 32, 17, 28, 50, 30, 18, 26, 10}, then the target value is {0.48, 0.15, 0.55, 0.18, 0.45, 1, 0.5, 0.2, 0.4, 0} respectively. Interest value ðdocumentÞ ¼

n X

rankm ðxÞ

m¼1

new value ¼

original value 2 minimum value maximum value 2 minimum value

Besides, the Delta rule is used here to adjust the weights (Roiger, 2003). The goal is to minimize the sum of errors between computed and target output. In the Figure 4 shown below, h is the learning rate parameter and T is the target output. Ok is the computed output at node k and (T-Ok) is the actual output error. f’(xk) is the first-order derivation of the Sigmoid function and Wjk is the weight associated with the link between node j and output node k. More details are shown in algorithm 2. Experimental results and analysis Experimental platform In this section we present a prototype to test performance. Real stock information, rather than a synthetic dataset, is used as the data set. By using multi-threads technology, relevant stock data can be crawled from relevant sites. The prototype is implemented in VCþ þ on Microsoft Windows 2000 Server with 1G MHz CPU and 256 MB main memory for information crawling and Java for indexing and searching. On the basis of web service BEA WebLogic, the interface is implemented. An agent is used to log users’ navigate behaviors. The log records the number of relevant results that a

Table II. The top number of document views per query

Number

Proportion (%)

#1 2 , 15 $16

About 15 About 80 About 5

Similar interest clustering

593

Figure 4. Partial back-propagation

user chooses to view for each query. Users’ activities should include query statements and documents viewed. Collections cover those authority stock web sites all over China. Here, we categorize information into three main categories: finance news, annual report, and stock comments. In every stock-jobbing day, 100, 1,000 and 600 available pieces of non-replicated information can be obtained from finance news, annual report and stock comments respectively. In order to get the group information, a field is used in the log file to identify the category a page belongs to. Figure 5 shows the interface of the “Stock comments” retrieved results. Based on the experimental platform, three main experiments are shown below. The first is to analyze user query statements. This can give us some cues to determine the experimental values of some thresholds. The second is about the performance of the partial back propagation algorithm and the third is about the comparative performance. Analyzing users’ query statements As for adjusting the connecting-weights of the neural network, we fetch the “top-10” retrieved results to calculate the interests according to formula (2), because most users usually pay more interests to those “top-10” results. Increasing the number does not lead to significant improvements. More details are shown in Table II. Performance of the algorithm A comparison between the traditional and partial back-propagation is shown in Figures 6 and 7, where the X-axes represents the training time. From the comparison we can see that the traditional algorithm needs more time while partial back-propagation’s convergence speed is faster because the training time in Figure 7

LHT 23,4

594

Figure 5. The interface of the prototype

Figure 6. Traditional BP

(X axis) is less than that it in Figure 6. The reason is that we only need to adjust part of the connecting-weight. F1 metric tries to compromise both recall and precision. In order to measure the “top-N” recommendation, we show the performance in Figure 8. So it can be seen that a reasonable recommender scope can enhance the performance. “Top-10” is a better choice in our prototype.

Similar interest clustering

595

Figure 7. Partial BP

Figure 8. Effectiveness

Approach Content-based Collaborative-approach Hybrid approach Partial BP approach

Precision (%)

Recall (%)

18.3 6 7 16.46

38.1 20.7 34.4 37.93

Table III. Comparative results

LHT 23,4

596

Comparative performance To compare the performance, we conducted an experiment to compare our system with the method shown in Huang et al. (2002), where three types of links in the graph model were traversed to find books that had strong association with the customer in a digital library. Here, we randomly select 10 users’ behaviors from the log file within the finance news group. History data, including a list of web pages and those corresponding linger-time, is recorded in the log file. For each user, those recent data can be used as the target recommender scope while those earlier data can be used to generate the recommendation. The results are listed in Table III. The authors in (Huang et al., 2002) used Hopfield algorithm to exploit high degree associations between books and customers. A content-based approach was for “book-to-book” while collaborative approach was for “custom-to-custom” and a hybrid approach used association weights and purchase history in the digital library. We use precision and recall as measures of effectiveness. The comparative results are shown in Table III. From the table one can conclude that the performance is not very high. One reason is that there is a gap between actual and target output. But as for the approach we proposed here, it can narrow the gap based on a user’s actual clicks. Besides, its convergence speed is faster. Conclusions In this paper, we propose a recommender approach based on common interests and a neural network. The approach attempts to integrate a metric of interests (e.g., click behavior, ranking) into the strategy of recommendation. Relevant results are first presented on the basis of a novel data structure named FPT-tree, and then, those relevant results are trained through a neural network. The learning is guided by users’ click behaviors. Through analyzing the experimental results, we can conclude that this approach, on the basis of common interests and the neural network, can recommend relevant results effectively. Although the experimental platform is on a stock data set, we think this algorithm can also be useful for numerical and text-based digital libraries. References Chiu, D., Wu, Y. and Chen, A.L.P. (2004), “An efficient algorithm for mining frequent sequences by a new strategy without support counting”, Proceedings of the 20th International Conference on Data Engineering (ICDE2004), Boston, MA, 30 March-2 April. CNNIC (2005), available at: www.cnnic.net.cn/index/0E/00/11/index.htm Dong, Y. and Zhuang, Y. (2003), “Web log mining on a novel competitive neural network”, Journal of Computer Research and Development, Vol. 40 No. 5. Duchamp, D. (1999), “Prefetching hyperlinks”, Proceedings of Usenix Symposium on Internet Technologies and Systems, Usenix 1999, Monterey, CA, June 6-11, pp. 127-38. Han, J., Pei, J. and Yan, X. (2004), “From sequential pattern mining to structured pattern mining: a pattern-growth approach”, Journal of Computer Science and Technology, Vol. 19 No. 3, pp. 257-79. Hansen, M. and Shiver, E. (2001), “Using navigation data to improve IR functions in the context of web search”, Proceedings of the ACM International Conference on Information and Knowledge Management, Atlanta, GA, November 5-10.

Huang, Z., Chung, W., Ong, T. and Chen, H. (2002), “A graph-based recommender system for digital library”, Proceedings of ACM/IEEE Joint Conference on Digital Libraries, Portland, OR, June 14-18. Nanopoulos, A., Katsaros, D. and Manolopoulos, Y. (2003), “A data mining algorithm for generalized web prefetching”, IEEE Transactions on Knowledge and Data Engineering, Vol. 15 No. 3, pp. 1155-69. Pei, J. (2002), “Pattern-growth methods for frequent pattern mining”, PhD thesis, Simon Fraser University, Burnaby. Roiger, R.J. and Geatz, M.W. (2003), Data Mining: A Tutorial-based Primer, Pearson Education Asia Limited and Tsinghua University Press, Hong Kong, pp. 246-64. Schafer, B., Konstan, J.A. and Riedl, J. (2001), “E-commerce recommendation application”, Data Mining and Knowledge Discovery, Vol. 5 No. 1-2, pp. 115-53. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W. and Shi, B. (2004), “Efficient pattern-growth methods for frequent tree pattern mining”, Proceedings of the Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2004), Sydney, 26-28 May. Wen, J., Nie, J. and Zhang, H. (2002), “Query clustering using user logs”, ACM Transactions on Internet Systems, Vol. 20 No. 1, pp. 59-81. Xu, J. and Bruce Croft, W. (1996), “Query expansion using local and global document analysis”, Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, August 18-22, pp. 4-11. Further reading Chuang, S. and Chien, L. (2002), “Towards automatic generation of query taxonomy: a hierarchical query clustering approach”, Proceedings of International Conference on Data Mining, Arlington, VA, April 11-13. Duda, R., Hart, P. and Stock, D. (2003), Pattern Classification, 2nd ed., John Wiley & Sons, New York, NY, pp. 230-69. Eirinaki, M. and Vazirgiannis, M. (2003), “Web mining for web personalization”, ACM Transactions on Internet Technology, Vol. 3 No. 1, pp. 1-27. Jones, S., Cunnungham, S.J., McNab, R. and Boddie, S. (2000), “A transaction log analysis of a digital library”, International Journal Digital Library, Vol. 3 No. 1, pp. 152-69. Sarwar, B.M., Karypis, G., Konstan, J.A. and Riedl, J.T. (2000), “Application of dimensionality reduction in recommender system-a case study”, Proceedings of WEBKDD Workshop at the ACM SIGKDD, Boston, MA, August 20. Sarwar, B., Konstan, J., Borchers, A., Herlocker, J., Miller, B. and Riedl, J. (1998), “Using filtering agents to improve prediction quality in the GroupLens Research collaborative filtering system”, Proceedings of the ACM Conference on Computer Supported Cooperative Work, Seattle, WA, November 14-18.

Similar interest clustering

597

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

OTHER ARTICLE

Lessons learned from analyzing library database usage data

598

Karen A. Coombs University of Houston Libraries, Houston, Texas, USA

Received 4 December 2004 Revised 8 April 2005 Accepted 23 May 2005

Abstract Purpose – The purpose of this paper is to describe a project undertaken at SUNY Cortland to develop a system that would collect electronic resource usage data in a consistent manner and allow SUNY Cortland to assess this data over several years. Design/methodology/approach – The project used data gathered from EZProxy server log files to examine usage of the library’s electronic resources. Findings – Through examining the usage data the library discovered that users were utilizing particular types of resources, from specific physical locations, and accessing those resources from specific pages in the library’s web site. Originality/value – By examining usage data for electronic resources, libraries can learn more than which resources are being used. Usage data can give libraries insight into where, when, how, and possibly why their users are accessing electronic resources. Keywords Online databases, Library users, Collections management Paper type Case study

Library Hi Tech Vol. 23 No. 4, 2005 pp. 598-609 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636373

Year after year the subscription cost of electronic resources are increasing, while library acquisition budgets experience only a small increase or remain static. Thus, libraries are facing difficult decisions; and must develop methodologies to select and deselect electronic resources each year. Traditionally, libraries evaluate resources based on two different methods: collection-centered and use-centered (Evans and Zarnowsky, 2000). Usage data play an important role in both of these methods. Since most database vendors provide monthly database usage statistics for libraries, many libraries rely on this information to make collection development decisions regarding electronic resources. There are pitfalls with this methodology. Usually, vendor usage statistics only show a count of how many searches were done in the database in any given month. Further exacerbating the differences in data reporting, libraries utilize a variety of different vendors to provide them with access to databases. Memorial Library at SUNY Cortland, subscribes to 70 databases via 15 database vendors, creating a number of problems. First, each vendor sends the library its own set of usage statistics. To perform cross database usage comparisons, the library must merge these statistics into a single form each month. For example, Memorial Library receives statistics from more than six different sources: statistics from FirstSearch; GaleGroup; Ebsco; Lexis-Nexis; ABC-CLIO; and CABI Publishing. While merging these statistics together is possible, it is often a difficult and time-consuming process. There are several reasons for this processing being time-consuming and difficult including the fact that each vendor formats their data differently. For example, GaleGroup offers libraries data that can be emailed in comma-delimited format; other vendors such as FirstSearch

provide merely a web-base report that is not downloadable, forcing libraries to cut and paste data for comparisons. Second, libraries can have multiple accounts with the same vendor creating confusion about which data should be used. This is the case for Memorial Library which has three accounts with FirstSearch. Each account is slightly different from the other. However, some databases are duplicated in these accounts and as a result the library chooses to turn off access to certain databases in some accounts. For example, the ERIC database is available in two of our FirstSearch accounts. However, in one account the database is present as part of a package but is pay-per-search; meaning each individual search costs the library money. Because this database is one of the five most frequently used, the library pays for an annual subscription that allows unlimited searching, in another account. This duplication of databases in FirstSearch accounts creates confusion when collecting usage statistics. In fact some usage data can be completely overlooked if the librarian analyzing the data is unaware that multiple accounts with a vendor exist. Lastly, libraries have access to resources via a consortium and have no direct access to the vendor usage statistics. As part of the SUNYConnect (for further information see: www.sunyconnect.suny.edu/) project, Memorial Library has access to Elsevier’s Science Direct database. While Memorial Library receives monthly usage statistics for Science Direct as part of SUNYConnect, the library is unable to generate individualized reports reflecting the local usage of the database. Moreover, consortia data rarely separates use by individual institution providing only aggregate sums of use. An additional issue is the individual library is unable to know with surety that each database vendor is counting usage in the same way. Examining the variety of usage statistics offered to library by the various vendors reveals that there are two ways of counting usage, each with its own problems and implications for analysis. Usage is usually counted by either sessions or searches. This makes cross-database comparisons and especially complex process. In a typical search, a search string is sent to the database, which tells it exactly what to search for and how to search for it. For example, a search would be keyword: cats and keyword: Egypt. In this search we are looking for the word cats and the word Egypt in the title field. There is a problem with using search data to make cross-database comparisons. Depending on the skill level of the user and the usability of the database, the number of searches it takes the user to find the information he or she needs can vary. Because full-text databases search the full-text of articles the likelihood that the user will retrieve results and the number of those results is increased. In contrast, specialized databases such as ERIC use a controlled vocabulary that increases the difficulty of searching. Users need to employ this controlled vocabulary to find information matching their needs. This phenomenon can be seen in the variation between the ratio of Queries per Session for individual databases reported by Blecic, Fiscella and Winberley in their 2001 College and Research Libraries article (Blecic et al., 2001). This means that all the search usage numbers are potentially inflated; and more importantly some database search numbers can be more inflated than others. Because of this potential inflation of usage data, it is inaccurate to compare database usage utilizing number of searches. A more effective way to measure usage is to count sessions. On the most basic level, a session is an interaction between a user and a database at a given time to fulfill an information need. There can be several searches in a given session. The searches within a session all relate to the same information need. However, each search uses a different query to obtain the needed information.

Library database usage

599

LHT 23,4

600

Counting sessions eliminates the problem of inflated the usage statistics. However, there is some variation the method of counting number of sessions. One commonly used definition for a session is activity generated from a given Internet Protocol (IP) address. The IP address is considered part of the same session until there is at least a 20-minute break in activity. Return visits from the same IP address after the 20-minute gap, are assigned to a new session. One problem with this strategy occurs when visitors arrive at the library gateway through a proxy server. Thus during the course of one browsing session, a visitor may arrive from multiple proxy server IP address, the value depending on which server was assigned to manage that request. Another definition of a session is the connection of an instance of a browser. This type of information may be stored in a server web logs, but usually requires the use of cookies to mark the communications from each browser copy. Since users must authenticate to use the electronic databases, some vendors may use cookies, while others rely only on IP addresses. Therefore, merging the separate different vendor’s data into a single file may not provide an accurate comparison. The issues of data comparability and consistency might be solved if all database vendors agreed to count usage in the same manner and stored usage reports in a central location. However, the problems with vendor usage data are even more pervasive. The vendor usage data is completely detached from all other information about the databases and the environment in which they are being used. Vendor usage statistics provides no information on the following questions: whether the database are being used on or off-campus; how long a database is used by a particular user; what page the user utilized to access the database; where the database is acquired from (consortia fee, or individual subscription); the perceived value of the database to specific subject areas; the database cost; and the type of database (full-text, abstract, citation). By examining all of these issues, it becomes clear that usage data provided through the database vendors are not robust enough to meet a library’s needs. Consequently, during the summer of 2002, SUNY Cortland undertook a project to implement a system to track the usage of the library’s databases. The goals of the project were to gather comparable statistics on how often, when, and where the databases were being used; to create a way to maintain several years of database usage data comparable data; to examine database coverage in relationship to SUNY Cortland’s academic subjects; and to learn more about our users’ searching behavior (how, when, where, what they search). After some discussion and thought, the Information Technology Librarian designed a plan that was a unique blend of technologies, and circumstances. Like most libraries, SUNY Cortland needs to proxy off-campus users’ access to library databases. Cortland uses a SUNY-wide licensed product called EZ-Proxy to accomplish this task. In performing the proxy task, EZ-Proxy generates a log file that includes the user IP address, date/time, URL being access, and the referring page. Sample log entry: 24.169.73.151 - - [10/Apr/2003:06:39:13 -0500] “GET http://libproxy.cortland. edu:80/login?url = By using the information in these log files individual database usage can be counted. Duy and Vaughn (2003) highlight many of these problems. They describe North Caroline State University’s homegrown approach to basic usage data for their electronic resources. Based on data obtain from this method, Duy and Vaughn

determined that “for most vendors, the library’s internally gathered data show similar patterns to vendor collected usage data.” (Duy and Vaughn, 2003, p. 19). They also point out that the strength of this method is that data is collected consistently for all products and is therefore comparable.

Library database usage

Methodology While this article was helpful, there are several differences between the North Carolina State project and the project undertaken at SUNY Cortland. First, the project at North Carolina Project utilized web server logs in order to count database usage. Duy and Vaughn note that this created an issue when users visited individual e-journals that were part of larger packages. These visits were not counted as part of the usage data because users utilized an e-journal list or the library’s catalog to visit these resources, avoiding the NCSU’s counting system completely. At Cortland all requests for e-journals or any other type of electronic resources, regardless of where the user comes from (catalog, web site, or SFX), are passed through the library’s proxy server. This leaves one way in which users can enter a database without being counted: by utilizing a bookmark for the database or e-journal.Duy and Vaughn (2003) also note this problem. However, this is a very small group of users. At SUNY Cortland, most users are passed through the library’s proxy server, which keeps a record of every database requested. Users from campus IP addresses that are automatically authenticated while users accessing resources from other IP address are required to log on to connect to the databases. Users off-campus are assigned a session ID that is passed much like a cookie, back and forth between the browser and the proxy server allowing us to track them even if their IP address changes during their session. For both on and off campus, the information is reduced to a connection event and the databases requested during that connection. The problem with outside proxy servers is resolved by comparing the session ID that is unique for each instance of a browser and is not specifically tied to an IP address. Using this counting methodology assures that sessions are being counted in the same manner across the databases whose usage is being compared. Second, the North Carolina project was undertaken to see if comparable usage data could be obtained by the library itself and to determine if this data was reliable and in line with the data obtained from the database vendors. The SUNY Cortland project not only wanted to be able to collect comparable quantitative usage data, but also sought to examine this data in the much broader context of where and how resources were being used. In addition to the basic usage data, other usage information was further analyzed. SUNY Cortland has a switched local area network, which is divided into subnets. Creating a subnet divides a network into smaller pieces that are individually addressed. As a result, only a particular range of IP addresses come from a subnet. For example, the library subnet is 137.123.152.0/22. Typically, on Cortland’s campus each building has its own subnet. This means that the log files can be examined to see where on campus (what building) users are accessing the databases. For our purposes, we have chosen to divide the campus buildings into three categories: library, academic, and residence halls. Based on the IP address, we are able to determine if the usage is taking place in the library, an academic building, or a residence hall.

601

LHT 23,4

602

Another piece of the log file that is extremely useful to Cortland is the Referrer. The Referrer tells you what page the user was on when chose to go to the URL that is logged. In the log below the user was on the page at the address http://library.cortland. edu/full-text_db.asp (SUNY Cortland’s list of full-text databases) when they chose to go to the Expanded Academic Database. http://libproxy.cortland.edu/login?url=http:// infotrac.galegroup.com/itweb/sunycort_main?db=EAIM This data is extremely useful because it gives us information about how the users think about accessing the databases: by title, by subject, or by the fact that the database contains full-text. More importantly when our users utilize a subject list of databases to access a database, we are able to tell which subject page they were using. From this we can infer what subject they believed corresponded to their information need. This inference is possible because all of the SUNY Cortland library web pages which contain links to databases, are database-driven. This means that the Referrer contains a subject ID for the subject page that the user was on when they selected the database. Based on the subject ID we are able to infer the subject. This is very helpful for analyzing and comparing full-text database usage. Most of our full-text databases have general subject coverage. From examining the full-text database sessions from subject pages, it can be determined how much each subject area uses a particular full-text database and how many subject areas find a particular full-text database useful. Currently, the pages that contain links to and information about the library’s databases are database-driven. As a result we have a variety of information on our databases, stored in a Microsoft SQL Server database; this includes the database name, description, coverage, how often updated, if the database contains full-text, subjects associated with the databases, and the vendor associated with the database. This information can then be coordinated with the usage statistics, resulting in the ability to analyze usage statistics in relation to the database subject, and to the database content, full-text or not. This was extremely helpful in creating the different types of reports needed to understand database usage. One of the most important portions of this project was to create a data archive for all the usage statistics. The EZProcess database is not meant to be used as a data archive. It is merely a processing program for the log files that creates text reports. Therefore, a more permanent data archive was needed. There are two tables in the EZProcess database that store all of the usage data: tblSubjectUse and tblDatabaseActivity. In order to archive this data once the log files are processed, data from the EZProcess database is transferred into the SQL Server database which houses the rest of the database information. By placing the usage statistics in the same database as the electronic resources information, a variety of reports can be run. In addition, data from multiple years can be compared. Implications and further study This project has been collecting information on database usage for approximately two years. Over the course of those two years we have learned a number of interesting things from the data collected. The types of things our library has learned from the database usage statistics can be divided into three topics: what database users are accessing; how users are accessing databases; and where users are accessing database from. In terms of what users are accessing, the library received several surprises when examining the data. First, 46 percent of the library total annual database usage came from five databases: WilsonSelect Plus, Expanded Academic, ERIC Lexis-Nexis Academic Universe, and Health Reference Center. All of these databases with the

exception of ERIC are full-text databases. It was surprising to the Memorial Library faculty to realize that such as small percentage of the 73 online databases were used. Second, full-text databases have higher overall usage statistics than databases with no full-text. Of the library’s total annual database usage (95,390) 50 percent of this was full-text database use. However, upon further examination, it was determined that there was a dramatic difference in the usage of the various full-text databases. Gale Group’s Expanded Academic database was used extensively while other full-text databases of equal quality such as Ebsco’s MasterFile Select and Elsevier’s Science Direct were used much less. This raised the question of why some full-text databases are preferred over others. An examination of the relevant literature reveals several possible factors in why users might be choosing one database over another. First, the information architecture of the library’s site may contribute to why users choose one database versus another. A 2002 study by Dietrich, Gordon, and Wexler suggests that there is a relationship between the number of links on a web page and the ability of users to search efficiently (Dietrich et al., 2002). Furthermore, several usability studies of library websites indicate that navigation menus and labeling can impact library users ability to locate information and the links that users choose to click on. In particular, McGillis and Toms (2001) noted out that users “had trouble choosing from the list of menu options and differentiating from the possible choices”. Perhaps the library’s menus lead users to some full-text databases more often than others. However, further usability testing is needed to determine the effect of the site’s information architecture on which databases users choose to use. Second, the usability of the database might also explain why some full-text databases are used more than others. However, users need have to have past experience with the database in order for this factor to come into play. Furthermore, if the usability of the database was a significant factor, one would expect that SUNY Cortland’s data would show that the most frequently used databases have the same interface. Yet the two most used full-text databases at SUNY Cortland have different interfaces. Suggesting that interface alone does not determine why users choose to use a given database. A third possible factor for why users choose a particular full-text database is familiarity with the database. In a 2004 study Holliday and Li noted that students “often stuck with the familiar and what had been recommended by a teacher or peer or demonstrated by a librarian” (Holliday and Li, 2004, p. 364). This seems to be the case at SUNY Cortland as well. The library teaches basic information literacy classes to first year students during their second semester Academic Writing classes. As part of these classes students are introduced to a select group of library databases. In particular, two full-text databases are highlighted for students: Expanded Academic and WilsonSelect Plus. These two databases are the most frequently used full-text databases. This would suggest a correlation between the usage of databases and the databases being taught as part of the library’s information literacy session. Based on the data gathered the library could also determine how users accessed databases. There are a limited number of paths a user can take to enter a database via the library’s website. Therefore, when users are accessing a database they are selecting based on the database’s name, subject, the journal and magazines it contains or by the fact that it contains full-text. Using the referrer we determined how the user was selecting the database. Initially, we expected the students would be accessing the databases by the subject of the material that the database contained. Interestingly, upon examining the data this turned out not to be the case. The data collected revealed that students were

Library database usage

603

LHT 23,4

604

accessing the databases primarily by the database name. Based on the data gathered by Holliday and Li, this would suggest that students select databases based on their name because they have been introduced to particular databases via information literacy session, reference encounters, or their professors. However, it would be advisable to conduct a similar study at SUNY Cortland in order to confirm these findings. The question of how users are accessing the library’s databases became even more important when the library implemented the SFX system in the January 2004. SFX (for further information see: www.exlibrisgroup.com/sfx.htm) is a link server that uses OpenURL (for further information see: www.dlib.org/dlib/july01/vandesompel/ 07vandesompel.html) to connect a citation for an article in one database to full-text of the article in another database. This makes SFX database use different than traditional database use in that the user is not going into the database search for a particular topic, but rather to obtain the full-text of an article. Thus far, the library has not sought to subdivide the SFX-type of database usage. However, this is something that the library will try to implement for the 2004-2005 year. This information will help the library determine how often users are reaching a given database because of SFX. It is expected that that the implementation of SFX will dramatically impact the method in which the library’s electronic resources are being exploited, since the use of full-text databases has increased. Perhaps full-text databases that were not being used prior to the implementation of SFX will experience more usage due to its implementation. Preliminary data gathered during the spring of 2004 seems to suggest that this is the case. Use of both the MasterFile Select and ScienceDirect databases has increased dramatically from Spring 2003 to Spring 2004 (see Appendix 1 and 2, Figures A1 and A2) Upon examining the statistics regarding where databases were used, the librarians were very pleasantly surprised to discover that users accessed the databases within the library. Initially, based on our statistics from the reference area, it was though that database usage inside the library would be extremely low. The library averages approximately 148 questions at the reference desk a month. However, when the usage data was examined it was determined that 43 percent of the database usage was in the library. This would indicate an average usage of 2,829 database session per month in the library, a number significantly larger than our reference encounters. This information created a host of questions about student library research behavior. How could students be using databases but not asking reference questions or using computers in the librarys’ reference area? One possible explanation for this is the fact that there are three computer labs available to students in the library. Students accessing databases from the computers in these labs rather than computers in the libraries reference area could account for the higher than expect in-library database usage. However, this led to the question “why are the students using the computers in the labs instead of those in the reference?” During the time when most this data was collected there were two major differences between the lab computers and those in the reference area. First, the lab machines were newer and speedier machines than those in the libraries reference area. Second, the lab machines had productivity software (Microsoft Office) installed. It seemed that students wanted use the speediest computers to simultaneously research and write. As a result, the lab computers were more attractive to students than the reference computers. Reference staff only noticed students using the reference computers during high traffic times of the year, when the other computer labs were full.

Based on the data gathered and this conclusion the library administration decided to fund the replacement of more than half the computers in reference with newer machines. Since these computers were upgraded in January 2004 the library staff have noted an increase in the overall traffic in the reference area with the average number of directional questions asked at the reference desk increasing in the spring of 2004. However, the average number of reference questions continues to decline creating more questions than answers for our library and suggesting more in-depth user studies need to be conducted. Conclusion There are many aspects of user-behavior that a library can learn about by gathering an analyzing database usage statistics. Counting database usage is a simplistic answer to what is ultimately a complex question: how do users access the library’s resources. Viewing the data in different ways can provide librarians and library administrators with richer knowledge of how electronic resources are currently being used. The SUNY Cortland project has demonstrated that there are a vast number of benefits to individual libraries gathering database usage statistics. First and foremost, the project has proved that libraries can gather comparable statistics about database usage rather than relying on vendor data. However, locally gathered usage data has many other advantages. Libraries gain the capacity to better link usage data with local information about the database. Secondly, locally gathered usage data allow libraries to segment usage by location: off or on campus and then potentially by building type. Lastly, this methodology gives libraries the ability to examine the path users take to reach databases. As the project at SUNY Cortland has shown, the data gathered from this process can provide libraries with essential decision-making and planning tools. Ultimately, this information can be used to enhance the experience for library users. References Blecic, D., Fiscella, J. and Wiberley, S. Jr (2001), “The measurement of use of web-base information resources: an early look at vendor-supplied data”, College & Research Libraries, Vol. 62 No. 5, pp. 434-53. Dietich, J., Gordon, K. and Wexler, M. (2002), Effects of Link Arrangement on Search Efficiency, available at: www.otal.umd.edu/SHORE/bs09/index.html (accessed May 22, 2005). Duy, J. and Vaughn, L. (2003), “Usage data for electronic resources: a comparison between locally collected and vendor-provided statistics”, Journal of Academic Librarianship, Vol. 29 No. 1, pp. 16-22. Evans, G. and Zarnowsky, M. (2000), “Chapter 15 Evaluation”, in Evans, G. (Ed.), Developing Library and Information Center Collections, Libraries Unlimited, Englewood, CO, pp. 429-53. Holliday, W. and Li, Q. (2004), “Understanding the millenials: updating our knowledge about students”, Reference Services Review, Vol. 32 No. 4, pp. 356-66. McGillis, L. and Toms, E. (2001), “Usability of the academic library web site: implications for design”, College and Research Libraries, Vol. 62 No. 4, pp. 355-67.

Library database usage

605

LHT 23,4

606

Figure A1. SUNY Cortland Library 2002-2003 database usage statistics

Appendix 1.

Library database usage

607

Figure A1.

LHT 23,4

608

Figure A2. SUNY Cortland Library 2003-2004 database usage statistics

Appendix 2.

Library database usage

609

Figure A2.

The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister

LHT 23,4

610 Received 19 April 2005 Revised 26 May 2005 Accepted 26 May 2005

The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm

OTHER ARTICLE

Using screen capture software for web site usability and redesign buy-in Susan Goodwin Texas A&M University Libraries, Sterling C. Evans Library, Instructional Services Department, College Station, Texas, USA Abstract Purpose – The purpose of this article is two-fold: to provide guidance on how to present persuasive web site redesign presentations to library stakeholders; and to introduce screen capture software as an effective and persuasive tool for usability studies to both record data and promote redesign recommendations to library stakeholders. Design/methodology/approach – Includes a review of the literature summarizing effective techniques used to create persuasive presentations and recounts how Camtasia Studio, Techsmith’s screen capture software, was employed by the Usability Committee at Texas A&M University Libraries to record usability tests and present the committee’s redesign recommendations to library administration and staff. Findings – Screen capture software in conjunction with effective presentations can have a positive impact on library-wide buy-in for web site redesign initiatives. Originality/value – Will be of interest to usability committees who want to streamline data recording and distribution techniques, as well as provide colleagues with a more compelling “data-rich” option for the presentation of findings. Keywords User interfaces, Design, Presentations, Video Paper type Case study

Introduction Part one outlines issues usability committees (hereafter, UC) must consider when executing their assigned tasks. Emphasis is placed on achieving organizational buy-in as a key to successful redesign. Buy-in requires vigilance as the committee moves from the preparation and implementation of the web site usability tests to the formulation and presentation of redesign recommendations to the web site committee (hereafter, WC), administration, and/or the organization as a whole. Part two discusses the benefits of using screen capture software as a tool to assist the UC. It outlines how Camtasia aids in subject testing and documentation (including data collection) and expands on how Camtasia can be used to increase the overall effectiveness of the UC’s communications regarding its’ recommendations. Library Hi Tech Vol. 23 No. 4, 2005 pp. 610-621 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830510636382

Part one Buy-in: the UC’s prime consideration Much has been written on how to run effective UCs and much of the focus in the literature revolves around the process of implementing good studies with effective

testing measures. In addition to such “process” considerations, committees should execute their assigned tasks in such a way as to ensure the maximum possible buy-in from the organization. Buy-in within the context of web site redesign initiatives, as defined here, refers to the level of understanding and support given by individuals within the organization for the recommended changes to the web site. Without both broad support and understanding UCs’ redesign efforts may be delayed or undermined. However, garnering staff buy-in does not mean the UC must cater to the whims of any or all interested parties; especially since these may simply reflect idiosyncratic needs and biases. Instead, the goal ought to be to conduct usability tests with the end-users in mind to ensure that redesign reflects “objective” end-user data and is in accordance with the organization’s broader goals and mission; i.e. objectives towards which everyone in the organization is working. Internal communication responsibilities: the key to successful redesign initiatives When communicating with staff and colleagues about redesign initiatives do not overlook the need to state the obvious; i.e. that user-centered redesign initiatives begin and end with the end-user in mind. While this may seem like a given, it can sometimes get lost in the emotion of the moment as people debate the finer details of change x or y on the web site. Take the time to educate the organization and the UC (even if it is a refresher) on the basic tenets of usability. Dray (1994) reminds us that “technology designed by and with the user has a higher probability of success and of meeting the actual business objective”. With a user-centered design it will be easier to get buy-in from staff because both the committee and staff can rally behind the shared goal of providing relevant customer-service. As Thomson (1997) indicates, “change focused on customer satisfaction aims to empower employees and build a shared commitment to success”. Also, communicate as early and as often as is realistically possible. Share your activities with staff and seek their feedback. Do this as you move through the process; from defining your objectives, to formulating your tasks for the study, and then on through the presentation of your findings and recommendations for change. Be sure to communicate in such a way that you reach a wide audience. This may involve then need for multiple meetings with repetitive messages and multiple methods for communication (i.e. e-mail updates, committee documentation on the library’s server or intranet, brown-bag meetings, etc.). It is also strategically wise to anticipate the possibility that every decision made by the UC will at some time or another be contested. “If it doesn’t go to trial with you first, it will definitely be on trial when you take it to the rest of the organization” (Smith, 2003). Be prepared to document all of the UC’s activities from start to finish; especially usability test data. Your usability test data is precious. Be sure and devise a way to capture as much of it as you can and in as rich a way as possible. Later, this data will be key to justifying the committee’s recommendations. Without valid data it will be impossible to give an accurate assessment of the current web site and reliable, unbiased recommendations for change. Consequently, the absence of valid data will threaten organizational buy-in. Taking the time to gather the evidence and present it in a comprehensible fashion demonstrates to staff that the committee is doing a thorough job and is not relying simply on the UC’s own design preferences.

Screen capture software

611

LHT 23,4

612

Before presenting your findings and redesign recommendations to colleagues engage in a role-playing exercise to challenge your assumptions. Those not directly involved in the usability testing may express their concerns bluntly. The UC should be prepared to hear comments such as: “Oh yeah? Show me. Prove it” (Smith, 2003). Design recommendations need to be firmly supported by evidence. If the UC has painstakingly based its’ recommendations on objective data, then it must be sure to make this clear to colleagues. Don’t gloss over things in a rush to present your recommendations. And again, be sure to demonstrate the rationale for each recommendation by providing a summary of the evidence so colleagues feel confident that the UC’s decisions were informed by the data collected and not the UC’s own design preferences. Effective presentations matter The UC is charged with two basic tasks; conducting accurate, meaningful usability tests and formulating well-justified recommendations. Equal diligence, however, should also be used in the presentation of findings. The importance of this step may even exceed that of the usability test itself. Why? Because web site redesign ultimately affects the entire organization and therefore all successful redesign initiatives necessitate organizational support and understanding for why the changes are being made. If you cannot effectively state your case for change, then organization buy-in will not follow. It is of prime importance to communicate the committee’s activities and seek feedback during the design of the usability test itself and continue to communicate with staff about the committee’s activities once usability testing commences. Bringing people along throughout the whole process ensures that by the time the committee is ready to make redesign recommendations there are no major surprises. Be sure to present these final recommendations formally to all the stakeholders within the library to ensure everyone has a chance to ask questions and provide feedback. Simple strategies for effective presentations: (1) Begin by reviewing the methods used in the usability study. Be sure to address the objectives of the test. As mentioned, the ability to anticipate responses before they arise prepares the UC for the job of managing expectations. Laying out the groundwork for the study prior to the presentation of findings better informs your colleagues who are then in a position to express concerns in a grounded and productive way. A complete description of the study’s aims might also help “ease the blow of seeing an unflattering evaluation report” (Marine, 2002). (2) Develop ways to build a sense of unity during the presentation. According to Thompson, for example, “change that is focused on customer satisfaction aims to empower employees and build a shared commitment to success” (Thomson, 1997). Hence, by beginning your presentation with a brief reminder of the organization’s mission (e.g., meeting clients needs) the UC generates a shared sense of purpose and, ultimately, responsibility. In turn, establishing an environment of consensus, however preliminary, will assist the UC as it seeks to present findings and ideas some may perceive as controversial or threatening. (3) Make sure to include the “why” behind your recommendations. Buy-in requires more than simply communicating information about the changes that you propose. Colleagues also need to understand the rationale on which the changes

are based if they are to commit fully to them. An effective presentation of findings and recommendations requires the UC to anticipate (and appreciate) the type and degree of impact redesign will have on colleagues’ day-to-day work lives. For example, front line staff (and their managers) should feel confident about explaining the purposes for web site changes to patrons. Colleagues who do not understand the rationale for changes cannot be asked to promote them. Equally, uncertainty among your colleagues as to how or why to “sell” to patrons web site changes that they themselves did not initiate will threaten the much needed staff buy-in. (4) Finally, one of the key ingredients of effective presentations involves knowing your audience. The UC should always take time to consider the “frame of reference” through which colleagues will perceive the data and changes presented. Factors that shape the perception of information include beliefs (e.g., about the adequacy or inadequacy of the previous web site), customs (e.g., practices that reflect “our” vs. “their” way of doing things), personal attitudes, and circumstances (Caricato, 2000). Clearly, different departments within the library will have different “frames of reference.” These differences will yield varying concerns about redesign recommendations. Given this, the UC committee might consider “individualizing” their presentations for different internal groups. Individualized presentations benefit both parties (i.e. the UC and staff groups). The UC can use this method to develop and then promote what might be thought of as a “rationale of best fit”; i.e. an explanation that takes into account how particular redesign initiatives impact particular work functions within the organization. By proactively addressing the needs and concerns of function-specific groups within the library, the UC can tailor its’ presentation to address the customs, attitudes, circumstances, etc., of those whose blessings it seeks. Things to consider when presenting usability test data: . Time constraints will limit the quantity of data that can be presented. As the UC works to interpret and summarize data, a premium should be placed upon data which most accurately and effectively illustrates or justifies the UC’s recommendations. Less significant and, or less contentious data, should be de-emphasized. . All data is to some degree theory- (i.e. interpretation) dependent. At the same time, a well-designed and executed test yields data that should “speak for itself.” The UC should be sensitive to this double imperative. Use simple statements throughout the presentation to show how you preserved the initial transparency (i.e. objectivity) of your data. These statements can also serve to demonstrate a clear conceptual link between your findings and your proposed changes. For example, “We recommend change x because we found that . . . ” The UC should beware: the greater the degree of interpretation of data the greater the chance of allegations of bias. No UC is infallible; and depending on the communication “style” of the committee an effective presentation of data may begin with this frank admission. . Use visual aids to help you in presenting usability data. Here, the primary aim is to improve audience understanding and recall of the issues discussed. Research

Screen capture software

613

LHT 23,4

614

.

.

indicates that we tend to recall visual information with greater ease than discursive (Gehring, 1988). Caricato (2000) notes that “optimum [80 percent of] communication occurs via visual presentation” with the remainder of meaning(s) absorbed through speech. The UC should take advantage of the natural preference for visual and auditory forms as they seek to persuade their colleagues of the propriety of their recommendations. Visual aids may include graphs or charts that illustrate basic data (e.g., the average length of time required to conduct each task on the web site), video clips (e.g., of subjects performing specific tasks on the web site), and screen-shots (e.g., the original web site vs. mock-ups that reflect the proposed changes). Whenever possible record usability sessions to show subjects conducting specific tasks on the web site. Studies show that presentations that include video as a means to convey messages (vs. non-animated pictorial slides and/or audio-track clips) may result in retention rates significantly higher than those achieved with other communication methods (Gehring, 1988). Further, according to Murphy (2003), video can also “enhance positive receiver involvement and positive source credibility”. In other words, presenting usability data in a more compelling mode via video increases the UC’s) credibility and might, for this reason alone, achieve greater support. Always make available to staff copies of data collection transcripts. Here, the primary aim is to provide your colleagues with the opportunity to revisit the data following the formal presentation. Information processing is a time-dependent function. The ability to review data in a relaxed atmosphere (e.g., at home, on a bus, or with colleagues over coffee) can increase both the degree and durability of buy-in.

Organizational impacts of web site redesign Redesign efforts will impact various departments within the library. This is especially true of front-line public service staff (and their managers) who use the web site daily to assist patrons with services and resources. Design changes will also impact the workload of staff in computer systems departments (and potentially the technical services staff). Successful redesign, therefore, must account for these impacts as changes affect departmental workflow patterns, patron education (at service desks and in classrooms) and result in additional workloads for those charged with implementing the changes. Planning without regard for organization-wide impact can lead to unnecessary difficulties during the implementation phase of redesign efforts (Dray, 1994). Things to consider to keep the committee from stumbling: . Never assume that technological changes will be the “hard part” of redesign initiatives (Dray, 1994). It is more likely that people will represent the greatest challenges. Focus on keeping staff informed of the UC’s activities throughout the usability process – from testing to the redesign recommendations. “Frequent, clear communication helps increase commitment to the project and helps to dispel users’ natural fears and anxiety” (Dray, 1994). . Always keep technical services staff informed as you proceed from testing to the redesign recommendations. The UC’s recommendations should never come as a surprise. Frequent communication allows technical staff time to anticipate

.

possible problems before they arise and to plan in advance for the acquisition of new resources (if needed). Whenever possible strive to “de-emotionalize” the usability and redesign process. A commitment should be made to developing ways to move staff away from “emotional attachment” to the web site “toward an unbiased analysis of the fit between the design and the tasks and objectives” (Marine, 2002). The goal here is certainly not to devalue or denigrate emotional attachments. Rather, by emphasizing the relationship between the design of the web site and the concrete tasks and objectives that users seek to accomplish with it the UC can justify changes in accordance with the mission of the organization as a whole; i.e. above all, meeting users’ needs. In other words, if the UC does have a bias, then it’s a bias in favor of the library’s users alone.

Part two Screen capture software for usability studies As discussed in part one, it is essential to obtain accurate usability data, present unbiased findings, and articulate well-justified recommendations for web site changes to ensure successful redesign initiatives. This section provides an introduction to Camtasia – the screen capture software package used by the TAMU Libraries UC to record usability sessions. It also discusses how “captured” first-hand testimonials are used to garner buy-in for redesign initiatives. What is screen capture software and why use it for usability tests? Much like a VCR, screen capture software records the actions, sounds, and movements that take place on electronic monitors; in this case, computer screens. With the assistance of a microphone the software also allows for the recording of voice in conjunction with the action taking place on the screen (i.e. cursor movements). The resulting files are then saved, compressed, and played back, as needed using media players. In some cases the files can be converted to Flash for easy replay in a web browser. The tests conducted by the UC at TAMU Libraries required test subjects to complete a number of tasks on the library’s web site. The question of how best to record these sessions was debated. The following lists some of the issues and questions members of the committee wrestled with before deciding on a software solution. Capturing the data – benefits of Camtasia recorder What is the best method for acquiring usability data while minimizing data loss and recorder bias? Audio tape recorders capture the subject’s voice. They do not, however, capture movements on the web site as the subject completes specific tasks. Ideally, human recorders capture both the subject’s voice and movements on the web site. In practice, however, human recorders tend merely to reintroduce limitations encountered with tape recorders. And in cases where more than one recorder documents the session (e.g., one focusing on collecting audio data, the other on visual data) conflicts arise at the interpretive stage as to the relative weight to be assigned to different “forms” of data (audio vs. visual). Also, multiple human recorders may mean erosion of consistency between sessions threatening the internal coherence of the study as a whole. And clearly, human recording renders the entire project much more labor intensive. Human recording also raises the specter of intentional and/or unintentional filtering of data.

Screen capture software

615

LHT 23,4

616

As previously reported in Xiao (2004), Camtasia, TechSmith’s combined recording and editing suite, offers the ability to record 100 percent of audio and visual data in real time. Camtasia screen capture software therefore resolves many of the problems associated with previous data collection methods. All on-screen actions and comments are saved to one complete file. Once saved to file, the entirety of the test data becomes available for reviewing by the UC and/or library staff. The data can also be effortlessly transferred (via shared folders on the staff server, intranet, distribution via e-mail, CDs etc.) between committee members thus aiding in data analysis. The saved files are, in effect, “raw” videos of each recorded session and can be replayed ad infinitum (e.g., to create printed transcripts). The sessions are also easily manipulated; i.e. they can be copied, edited, watermarked, and/or incorporated, etc., into publications and presentations that recount the study. Also, Camtasia eliminates the need for transcription during the usability session. This frees the researcher to focus solely on the subject and allows the work of transcript preparation to be assigned to one or more parties after the fact. Presenting the data – benefits of Camtasia producer Achieving organizational buy-in for web site redesign initiatives is a must. But what is the best method for presenting data that illustrates the “why” behind the proposed changes? What is the best way to gain the trust of an audience that may be suspicious of change? While first-hand accounts are more compelling than second-hand accounts not everyone in the library can participate in the UC or attend each usability session. Camtasia gives to everyone within the organization a “bird’s eye view” of each test subject’s session. The Camtasia software suite comprises a screen capture tool (Camtasia Recorder) and a video editing tool (Camtasia Producer). Using Camtasia Producer the UC was able to create summary videos of key usability issues. These were used during the presentation of findings to staff. Edited videos can be saved in multiple formats including. AVI, WindowsMedia, QuickTime, and RealMedia. These can later be embedded into other software programs such as MS PowerPoint. The UC used these summary videos to supplement the presentation of statistical data and direct verbal reports. As previously discussed, video capitalizes on the primacy of visual formats in human learning styles and aids in information retention. The video format also enlivens the presentation making the audience feel a part of the actual testing process. Finally, videos can be made available for review after the meetings (for those who were unable to attend or for those who wish to reconsider the presentation of findings). Presentation and buy-in Camtasia Recorder was used to record the voices of the test subjects and the interviewer and the actions performed by the participants on the computer screen (i.e. their mouse and cursor movements). A total of 23 interviews were conducted and recorded among three main user groups: undergraduates, graduates, and faculty. Once the testing was completed, student workers were enlisted to review the recordings and prepare transcripts. Committee members then reviewed the data (i.e. the unedited usability sessions and accompanying transcripts) and prepared a list of major design problems encountered by the test subjects. Recommendations for redesign were

formulated and presented to the library’s main Web Committee. The recommendations were then presented a second time for the entire library staff (Figure 1). The UC quickly realized how powerful their case for change could be if they were able to show library staff the actual testimonials captured on video during the presentation. Each video was an average of 45 minutes in length therefore time constraints prevented the showing of each of the 23 interviews in full. The UC prepared edited “clips” from individual sessions and pieced them together to make short “marketing” videos. These condensed videos centered on specific themes relating to the major design problems encountered by subjects. The videos were then shown to library staff and administration. Camtasia Producer, the editing portion of the Camtasia software suite, was used to create these thematic videos (Figure 2). With Camtasia Producer, clips can be cut, trimmed, and joined to other clips. Additional special effects such as watermarks and transitions inserted between clips can be added to enhance videos using a storyboard model (Xiao, 2004). The edited videos illustrated ways in which specific tasks were or were not accomplished by test subjects. In some cases, each video contained clips from several different test subjects’ sessions thus illustrating a single, recurring design problem. Each video was then inserted as a hyperlink into a PowerPoint presentation. This allowed the UC to walk the audience through each design issue. This was then followed by a video illustration and a recommendation for change based on the data previously illustrated on screen (Figure 3). Because each recommendation for change was supported visually by a corresponding video clip library staff could “see” and “hear” test subjects using the

Screen capture software

617

Figure 1. Preparing to record usability session

LHT 23,4

618

Figure 2. Editing video clips

Figure 3. Embedded video clip in PowerPoint presentation

web site and commenting on its relative effectiveness. The raw videos (i.e. the original recordings of each test subject), the edited marketing videos, the PowerPoint summary of findings, the usability session transcripts, and the UC’s recommendations were all saved to a central folder on the libraries’ network and an organization-wide invitation was extended to all parties to review the data.

Screen capture software

Surveying for buy-in As indicated, research suggests that incorporating visual information into presentations improves audience understanding and, in some cases, support for the messages being delivered. The UC wanted to survey the staff to see what kind of effect, if any, the availability of the marketing videos, unedited session videos, and transcripts had on their understanding of, and support for, the proposed changes. The Usability Committee hypothesized that making it possible for all staff to access the “as it happened” videos would result in a more willing acceptance of the proposed changes yielding a better overall grasp of the issues underlying the redesign initiatives. To explore this possibility, the UC developed and distributed a survey to all library staff that attended the presentation of findings and/or later reviewed the data from a remote location. In addition to being presented with the committee’s recommendations for redesign this group also watched short videos illustrating some of the problems associated with terminology across the web site (one of the more contentious issues). The survey asked respondents to rate their level of agreement with statements as outlined in the graph below. Agreement was measured along a continuum from 0 to 9, with 9 marking “Strongly Agree”. Questions included: Did watching the videos improve your understanding of the situation? Since watching the video clips, do you feel more confident that you can explain the rationale behind the UC’s proposed changes to the library’s patrons? Are you more, or less, in favor of the proposed changes after watching the videos? (Figure 4).

619

Figure 4. Survey responses for terminology video clips

LHT 23,4

620

The survey indicated that support for changing some of the terminology on the web site was high (with an average score of 8.11). Respondents also indicated (with an average score of 6.83) that they were confident they could explain to patrons why changes to terminology were needed. One could speculate that this was due in part to having watched the videos because respondents also reported (with an average score of 6.50) that seeing video clips from the web site usability study would improve their understanding of usability issues and help them to better explain web site changes to patrons. Similarly, they reported that reading the usability transcripts was also beneficial (average response score 6.39) in improving their understanding. After following the steps outlined in Parts 1 and 2 of this paper the UC encountered little significant resistance to proposed changes. The UC found staff to be strongly engaged throughout the formal presentation of findings. Questions did arise about how and where staff could access data but little skepticism was voiced about the actual rationale for proposed changes (since this was apparent from the videos). It should be stated that the UC did not conduct its usability testing research – including the survey of staff reaction – on the basis of any rigorous design or protocol (e.g., no attempt was made to isolate an independent variable or to control the conditions of the testing). Hence, any attempt to ascribe a direct “causal” relation between the method(s) used and the high degree of buy-in experienced would exceed the bounds of the study. Nonetheless, the UC believes that by presenting its’ findings as it did, and by making available the unedited usability sessions, thematic videos, and the printed transcripts, a significant level of staff buy-in was observed in comparison to previous redesign efforts. Possible reasons for this are as follows: . Access to the “real-time” recordings made staff feel involved; since they were able to view the first-hand accounts of the test subjects from the unedited session. Anecdotal comments revealed staff felt almost to have been in the room with the interviewer and test subject for the sessions. . The raw videos eliminated the need for a mediator (i.e. the UC) between the test subject’s experiences with the web site and the library staff. The UC believes this made the data more compelling by minimizing reporter bias. The low level of bias, both in the collection and interpretation of data, allowed the UC to portray its proposed changes as “objective”; i.e. as meeting users’ needs. . Videos and printed transcripts aided in increasing organization-wide understanding of key web site issues by appealing to a variety of learning styles (visual/spatial, auditory, verbal/linguistic). To repeat: verbal reports from staff indicated staff developed an appreciation of the reasons behind proposed changes because they were able to see and hear the users’ reasons “first-hand”. Conclusion The learning curve for effective use of TechSmith’s Camtasia’s software suite is relatively flat. The recording process is easy and editing is fairly straightforward. In general, TAMUL Libraries’ UC found Camtasia to be a very effective tool for recording web site usability data. Camtasia helped to reduce the workload of the usability team members, minimized data loss, and provided easy post-test data access across the organization. The

recorded usability videos also have marketing potential beyond one’s internal organization. For example, edited videos can be made available on the web site or presented to departments and committees on campus to illustrate the rationale behind the changes that most impact them. In addition to usability functions, Camtasia can be used for a variety of instructional and marketing applications. These include the production of videos to teach database searching techniques (to supplement in-house classes and support distance education), promotional videos to illustrate online resources and services (e.g., how to renew books in the online catalog), in-house software training, and the addition of voice to PowerPoint presentations. Overall, the use of Camtasia at TAMU Libraries proved to be an effective method to garner organizational buy-in for web site redesign. The software allowed the UC the opportunity to involve staff more intimately in the testing process and as a result staff reported being more comfortable with the proposed changes. In sum, they had a better understanding of the “why” behind each recommended change. This greater understanding ultimately led to overall support and confidence in the changes made. Current usability testing efforts underway are also being conducted using Camtasia. References Caricato, J.A. (2000), “Visuals for speaking presentations: an analysis of the presenter’s perspective of audience as a partner in visual design”, Technical Communication, Vol. 47 No. 54, pp. 496-515. Dray, S.M. (1994), Minimizing Organizational Risks of Technological Change, Association for Computing Machinery Conference Companion ACM, Boston, MA, pp. 385-6. Gehring, R.E. and Toglia, M.P. (1988), “Relative retention of verbal and audiovisual information in a national training programme”, Applied Cognitive Psychology, Vol. 2 No. 2, pp. 213-21. Marine, L. (2002), “Pardon me but your baby is ugly. . .”, Interactions, September/October, pp. 35-9. Murphy, P.K., Long, J.F., Holleran, T.A. and Esterly, E. (2003), “Persuasion online or on paper: a new take on an old issue”, Learning and Instruction, Vol. 13 No. 3, pp. 511-32. Smith, S. and Marcum, D. (2003), “New-School thinking: three old school principles – ego, speed, and solutions – are hurting organizations and must be expelled!”, Training and Development, Vol. 57, October, pp. 36-44. Thomson, K. (1997), “Market for employee buy-in (organizational communication of change programs)”, Communication World, Vol. 14 No. 5, pp. 14-16. Xiao, D.Y., Pietraszewski, B.A. and Goodwin, S.P. (2004), “Full stream ahead: database instruction through online videos”, Library Hi Tech, Vol. 22 No. 4, pp. 366-74. Further reading Horowitz, S. (1996), “Powerful presentations: achieve your goals when making presentations by following these guidelines”, Thrust for Educational Leadership, Vol. 26 No. 3, pp. 24-8.

Screen capture software

621

Literati Club

Awards for Excellence Roy Tennant California Digital Library, Oakland, California, USA

is the recipient of the journal’s Outstanding Paper Award for Excellence for his paper

‘‘A bibliographic metadata infrastructure for the twenty-first century’’ which appeared in Library Hi Tech, Vol. 22 No. 2, 2004 Roy Tennant is User Services Architect for the California Digital Library. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. He has also been instrumental in the development of the Simple Web Indexing Software for Humans – Enhanced (SWISH-E) software, and the internet resource guides Librarians’ Index to the internet and KidsClick! His books include Managing the Digital Library (2004), XML in Libraries (2002), Practical HTML: A Self-Paced Tutorial (1996), and Crossing the Internet Threshold: An Instructional Handbook (1993). Roy has written a monthly column on digital libraries for Library Journal since 1997 and has written numerous articles in other professional journals. In 2003, he received the American Library Association’s LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education.

Note from the publisher Outstanding Doctoral Research Awards As part of Emerald Group Publishing’s commitment to supporting excellence in research, we are pleased to announce that the 1st Annual Outstanding Doctoral Research Awards have been decided. Details about the winners are shown below. 2005 was the first year in which the awards were presented and, due to the success of the initiative, the programme is to be continued in future years. The idea for the awards, which are jointly sponsored by Emerald Group Publishing and the European Foundation for Management Development (EFMD), came about through exploring how we can encourage, celebrate and reward excellence in international management research. Each winner has received e1,500 and a number have had the opportunity to meet and discuss their research with a relevant journal editor. Increased knowledge-sharing opportunities and the exchange and development of ideas that extend beyond the peer review of the journals have resulted from this process. The awards have specifically encouraged research and publication by new academics: evidence of how their research has impacted upon future study or practice was taken into account when making the award selections and we feel confident that the winners will go on to have further success in their research work. The winners for 2005 are as follows: . Category: Business-to-Business Marketing Management Winner: Victoria Little, University of Auckland, New Zealand Understanding customer value: an action research-based study of contemporary marketing practice. . Category: Enterprise Applications of Internet Technology Winner: Mamata Jenamani, Indian Institute of Technology Design benchmarking, user behaviour analysis and link-structure personalization in commercial web sites. . Category: Human Resource Management Winner: Leanne Cutcher, University of Sydney, Australia Banking on the customer: customer relations, employment relations and worker identity in the Australian retail banking industry. . Category: Information Science Winner: Theresa Anderson, University of Technology, Sydney, Australia Understandings of relevance and topic as they evolve in the scholarly research process. . Category: Interdisciplinary Accounting Research Winner: Christian Nielsen, Copenhagen Business School, Denmark Essays on business reporting: production and consumption of strategic information in the market for information. . Category: International Service Management Winner: Tracey Dagger, University of Western Australia Perceived service quality: proximal antecedents and outcomes in the context of a high involvement, high contact, ongoing service.

Note from the publisher

623

Library Hi Tech Vol. 23 No. 4, 2005 pp. 623-624 q Emerald Group Publishing Limited 0737-8831

LHT 23,4

.

.

624 .

.

.

Category: Leadership and Organizational Development Winner: Richard Adams, Cranfield University, UK Perceptions of innovations: exploring and developing innovation classification. Category: Management and Governance Winner: Anna Dempster, Judge Institute of Management, University of Cambridge, UK Strategic use of announcement options. Category: Operations and Supply Chain Management Winner: Bin Jiang, DePaul University, USA Empirical evidence of outsourcing effects on firm’s performance and value in the short term. Category: Organizational Change and Development Winner: Sally Riad, Victoria University of Wellington, New Zealand Managing merger integration: a social constructionist perspective. Category: Public Sector Management Winner: John Mullins, National University of Ireland, Cork Perceptions of leadership in the public library: a transnational study.

Submissions for the 2nd Annual Emerald/EFMD Outstanding Doctoral Research Awards are now being received and we would encourage you to recommend the awards to doctoral candidates who you believe to have undertaken excellent research. The deadline by which we require all applications is 1 March 2006. For further details about the subject categories, eligibility and submission requirements, please visit the web site: www.emeraldinsight.com/info/researchers/funding/doctoralawards/ 2006awards.html