Designs for a global plant species information system 0198577605

103 61 20MB

English Pages [376] Year 1993

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Designs for a global plant species information system
 0198577605

Citation preview

OXFORD SCIENCE PUBLICATIONS

THE SYSTEMATICS ASSOCIATION SPECIAL VOLUME NO. 48

Digitized by the Internet Archive In 2022 with funding from Kahle/Austin Foundation

httos://archive.org/details/designsforglobalo00Ounse

Designs for a Global Plant Species Information System

Proceedings of an International Symposium held at Delphi in Greece, October 1990

The Systematics Association Special Volume No. 48

Designs for a Global Plant Species Information System Edited by

F.A. BISBY Biology Department, University of Southampton, Southampton,

UK

G.F. RUSSELL Department of Botany, Smithsonian Institution, Washington DC, USA. and

R.J. PANKHURST Royal Botanic Garden, Edinburgh, Scotland, UK

SX ae

/

THE

Systematics ASSOCIATION

Published for the SYSTEMATICS

ASSOCIATION

CLARENDON PRESS - OXFORD 1993

by

4)ibe i.

Oxford University Press, Walton Street, Oxford OX2 6DP

i

aga

Oxford

New York

Toronto

Delhi Bombay Calcutta Madras Karachi Kuala Lumpur Singapore Hong Kong Tokyo Nairobi Dar es Salaam Cape Town Melbourne Auckland Madrid and associated companies in Berlin Ibadan Oxford is a trade mark of Oxford University Press Published in the United States by Oxford University Press Inc., New York

© The Systematics Association, 1993 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any

form or by any means, without the prior permission in writing of Oxford University Press. Within the UK, exceptions are allowed in respect of any fair dealing for the purpose of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms and in other countries should be sent to the Rights Department, Oxford University Press, at the address above. This book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, re-sold, hired out, or otherwise

circulated without the publisher’s prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser. A catalogue record for this book is available from the British Library Library of Congress Cataloging in Publication Data Designs for a global plant species information system/edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst. p. cm. — (The Systematics Association special volume; no. 48) Proceedings of an international symposium held at Delphi, Greece, October 1990. Includes bibliographical reference and index. 1. Botany—Data bases—Congresses. 2. Botany—Classification—Data bases—Congresses. 3. Botany—Nomenclature—Data bases—Congresses. 4. Information storage and retrieval-Botany—Congresses. 1. Bisby, F.A. II. Russell, G.F. (George F.) II. Pankhurst, R.J.

IV. Systematics Association.

V. Series.

QK46.5. E4D47 1993 S81’.0285—dc20 ISBN 0 19 857760 5 Typeset by the Electronic Book Factory, Fife, Scotland Printed in Great Britain by Bookcraft (Bath) Ltd, Midsomer Norton, Avon

Preface

This volume contains the papers associated with the symposium ‘Designs for a Global Plant Species Information System’ held in Delphi, Greece, in October 1990 and with subsequent project developments up to 1992. The meeting was organized by the International Working Group for Taxonomic Databases in Plant Sciences (TDWG) in association with the Committee on Data for Science and Technology (CODATA), the European Cultural Centre at Delphi (ECCD), International Union of Biological Sciences (IUBS), the Linnean Society, and the Systematics Association. We were privileged to use the conference centre at the European Cultural Centre at Delphi, on the southern slope of Mount Parnassus in Greece. The purpose of the symposium was to stimulate debate concerning different designs for creating and operating a global plant species information system—a data system that would provide international access to data accumulated on all of the world’s plants. Design issues were discussed at five levels, corresponding broadly to the five days of the symposium and to the five sections of this book. We succeeded in stimulating debate in such large measure that planned discussions were eventually overtaken by a groundswell of opinion from participants wanting to start such a worldwide project immediately. Consequently the final round-table discussion was given over to appointing a ‘GPSIS Action Group’ charged with establishing such a project. The meeting was remarkable both for the range of disciplines brought together—from the areas of biology that use systematic species diversity information, from plant systematics, from computer science, from the information industry—and for the geographical spread of participants from almost every continent. The theme that emerged as most urgent for the biodiversity community was production of a computer-based species checklist of the world’s plants. This was seen both as an urgent necessity for a wide range of disciplines whose data were organized by species, as well as the basis for development of a more extensive information system containing botanical and associated data. The debate developed further in the months following the symposium. Which is the top priority: to create a checklist rapidly; to create a more measured synoptic treatment; to create a computer network for

vi

Preface

participants and users? And equally important, who should organize the project worldwide and how should it be managed? In 1991 the GPSIS (Global Plant Species Information System) project, proposed by the GPSIS Action Group from the symposium at Delphi, was linked with the SPP (Species Plantarum Project), described by Brummitt in Chapter 12. A single, newly formed, internationally supervised organization, the International Organisation for Plant Information (IOPI) has now taken on the tasks, with the two top priorities being: creation of a computer-based vascular plant checklist, and creation of a Common Directory of existing databases which, along with some access to the databases themselves, will

run within the Internet network. These latest developments are described in the last chapters added to the book in 1992. Because of the time that has elapsed since the original proposals made at Delphi, a number of the papers have been revised in minor factual points to bring them up to date (Chapters

10, 11, 12, and 17). However,

the

main spectrum of ideas is given here as presented at Delphi. Two chapters record elements of the discussion at Delphi: the presentation made by Bisby in response to debate on day 2 (Chapter 14), and a summary of themes raised in the round-table discussion on day 5 (Russell, Chapter 29). Finally Chapter 30 (The GPSIS Action Group: a call to action) reflects the initiative developed in 1990/1 after the symposium, and Chapter 31 (JOPI: the genesis of GPSIS, Burnett) brings this up to date with developments to November 1992. We are much indebted to our colleagues on the organizing committee: Professor

V.H.

Heywood

(IUCN,

Richmond,

UK),

Dr

R.

Hnatiuk

(Australian National Botanical Gardens, Canberra), Dr C. Guitakos (European Cultural Centre, Delphi, Greece), Dr R.W. Kiger (Hunt Institute, Pittsburgh, USA), Dr N.R. Morin (Missouri Botanical Garden, St Louis, USA) and Ms C. Zellweger (Conservatoire et Jardin Botaniques, Geneva). Their guidance in the planning stages, and subsequently their participation in the meeting itself as conveners and organizers was invaluable. It is a pleasure too to acknowledge substantial financial support from the European Community (DG XII) and the US National Science Foundation, plus a loan from the International Union for Biological Sciences. Further support and the sponsorship of individual items were welcomed from: BIOSIS, Philadelphia and York (folders), CAB International, Wallingford (reception), CODATA, Paris (travel), Chadwyck-Healey, Cambridge (programme), Chapman & Hall, London (abstracts), Greek Ministry of Culture, Athens (The ECCD Conference Centre), IBM Trust,

Hursley (travel), Linnean Society of London (travel), Missouri Botanical Garden, St. Louis (poster), Smithsonian Institution, Washington (secretarial assistance), Systematics Association (travel and symposium volume).

Preface

Vii

Lastly we thank the conference assistantt—Ms Ann Kawasaki (Smithsonian Institution), Mrs Vicki Segouni (ECCD, Delphi) and Ms Debby Smith (Southampton University)—who worked tirelessly to ensure that things ran smoothly on the day.

F.A. Bisby G.F. Russell R.J. Pankhurst

RB

2

>

.

a

piwes tigubdie, epd

he

Way « & >—_ ng wah wee

wee welis itis a

tw

Ces

| Ge

ote

Lote Vor ppt

ree 4

if

:

a4 7

ar :

7.

nar :

bomen

Chee 1 th it. wie

&

i grag “as

Contents

Contributors

Xili

and Conveners

Introduction

A Global Plant Species Information System (GPSIS): ‘blue skies design’ or tomorrow’s workplan? F.A. Bisby PART

Demand

1

for a global plant species information system

The need for a worldwide G.LI. Lucas

botanical reference system

US interagency botanical data applications, needs, and the PLANTS database J. S. Peterson NAPRALERT: problems and achievements in the field of natural products C. Gyllenhaal, M.L. Quinn, D.D. Soejarto, and N.R. Farnsworth Prolegomena on a species information system for the flora of the Rocky Mountains S.A. Morain

iS

20

38

The need for information on genetic resources M. Chauvet

my)

Plant breeding and resource information J.I. Cubero

62

Contents Standard and alternative taxonomic data in the multi-

institutional Natural Heritage Data Center Network

69

L.E. Morse

PAR

Tez

Botanical decision-making and data-collection strategies A view of the future for floristic research

83

A. Gomez-Pompa and O.E. Plummer 10:

Instability in biological nomenclature: problems and solutions J. McNeill

it:

Lists of names in current use and their possible role in a global plant species information system W. Greuter

Ad.

15:

94

109

The proposed ‘Species Plantarum Project’ (SPP) R.K. Brummitt

120

The ILDIS Project on the world’s legume species diversity

131

J.L. Zarucchi, P.J. Winfield, R.M. Polhill, S. Hollis, F.A.

Bisby, and R. Allkin 14.

Botanical strategies for compiling a global plant checklist F.A. Bisby

15:

The role of individual botanists and small organizations in the development and maintenance of a global plant information

system (GPIS) C.H. Stirton 16.

17:

145

158

Botanical decision-making and data-collection strategies—the role of small institutions U. Eggli

168

The role of large institutions in a global plant species information system N.R. Morin

174

Contents

PART

Xi

3

System design 18.

Designing a world service: a BIOSIS viewpoint M.N. Dadd and M.C. Kelly

19:

Centralized, distributed, and replicated databases: the pros and cons M.W. Freeston

181

190

20.

A global plant taxonomy database: design considerations M. Everard

198

PANE

Linking related databases: a microbiological approach B. Kirsop

Plo

Zen

Adopting a transaction processing model for a global plant species information system D. Jinks

226

Design aspects of an enterprise computing environment for systematics J.H. Beach

241

MR,

24.

Networks and communications: K.C. Roubicek

PART

the Internet

255

4

Data structures and software

Zoe

Alternative models for taxonomic data

265

C. Zellweger and R. Allkin

26.

Practical links between specimen and taxon databases R.E. Magill

Zio

Pa

A strategy for the evolution of database designs R.J. White and R. Allkin

284

28.

Software development strategies for global plant information systems R. Allkin and P.J. Winfield

304

Contents

xil

PART

5

Practical steps to establish a system 298

Management models for a global plant species information system

CPA

G.F. Russell 30.

The GPSIS Action Group: a call to action

Sh

C. Gyllenhaal, G.F. Russell, J.S. Peterson, N.R. Morin, and

F.A. Bisby on behalf of the GPSIS Action Group

31.

IOPI: genesis of GPSIS? J. Burnett

334

Index

343

List of Systematics Association Publications

347

Contributors and Conveners

ROBERT ALLKIN Royal Botanic Gardens, Kew, Richmond,

JAMES H. BEACH Department of Botany and Lansing, MI 48824-1312, (Present address: Gray Avenue, Cambridge, MA

Surrey TW9 3AB,

UK.

Plant Pathology, Michigan State University, East USA. Herbarium, Harvard University, 22 Divinity 02138, USA.)

FRANK A. BISBY Biology Department, University of Southampton, Southampton SO9 3TU, UK. RICHARD K. BRUMMITT Royal Botanic Gardens, Kew, Richmond,

Surrey TW9 3AB,

UK.

JOHN BURNETT Co-ordinating Commission for Biological Recording, 13 Field House Drive, Oxford OX2 7NT, UK. MICHEL CHAUVET Bureau des Ressources Génétiques, 57 Rue Cuvier, F-75231 PARIS Cedex 05, France.

JOSE I. CUBERO Departamento de Genetica, ETSIA, 3048, Cordoba 14080, Spain.

Universidad de Cérdoba, Apartado

MICHAEL N. DADD BIOSIS UK, 54 Micklegate, York YOI 1LF, UK.

URS EGGLI clo Municipal Succulent Collection, Switzerland.

Mythenquai 88, CH-8002,

Ziirich,

xiV

Contributors

MARK EVERARD IBM UK Laboratories Ltd, Hursley Park, Winchester, Hants SO21 2JN, UK. (Present address: National Rivers Authority, Rivers House, Waterside Drive, Aztec West, Almondsbury, Bristol BS12 7UD, UK.)

NORMAN R. FARNSWORTH PCRPS, College of Pharmacy, University of Illinois, Box 6998, Chicago, IL 60680, USA. MICHAEL W. FREESTON European Computer-Industry Research Centre (ECRC), Arabellastrasse 17, D-8000 Munich 81, Germany.

ARTURO

GOMEZ-POMPA

Dept. of Botany, 92521, USA.

University of California at Riverside,

(Present

Schiller No

address:

417, PB,

Polanco,

Riverside,

México

D.F.

CA

11560,

Mexico)

WERNER GREUTER Botanischer Garten und Botanisches Museum

Berlin-Dahlem,

K6nigen-

Luise-Str. 6-8, D-14191 Berlin, Germany.

CHARLOTTE GYLLENHAAL PCRPS, College of Pharmacy, University of Illinois, Box 6998, Chicago, IL 60680, USA.

SUSAN HOLLIS ILDIS Co-ordinating Centre, Biology Department, ampton, Southampton SO9 3TU, UK.

University of South-

DEIDRIE JINKS School of Computing Sciences, University of Technology, Sydney PO Box 123, Broadway, NSW 2007, Australia. MAUREEN C. KELLY BIOSIS, 2100 Arch Street, Philadelphia, PA 19103-1399,

USA.

BARBARA KIRSOP Microbial Strain Data Network, Institute of Biotechnology, Cambridge University, 307 Huntingdon Road, Cambridge CB3 OJX, UK. (Present address: Biostrategy Associates, Stainfield House, Stainfield, Bourne, Lincs PE10 ORS, UK.)

Contributors

XV

GRENVILLE LI. LUCAS Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AB, JOHN McNEILL Royal Ontario Museum, MSS 2C6.

100 Queen’s Park,

Toronto,

UK.

Ontario,

Canada,

ROBERT E. MAGILL Missouri Botanical Garden, P.O. Box 299, St Louis, MO 63166, USA.

STANLEY A. MORAIN Technology Application Center, University of New Mexico, Albuquerque, NM 87131, USA. NANCY R. MORIN Missouri Botanical Garden, P.O. Box 299, St Louis, MO 63166, USA.

LARRY E. MORSE The Nature Conservancy, USA.

1815 N. Lynn Street, Arlington,

VA 22209,

RICHARD J. PANKHURST The Natural History Museum, Cromwell Road, London SW7 5BD, UK. (Present address: Royal Botanic Garden, Inverleith Row, Edinburgh EH3 SLR, UK.)

J. SCOTT PETERSON National Plant Materials Center, USDA Soil Conservation Service, Building 509, BARC-East, Beltsville, MD 20705, USA. ROGER M. POLHILL Royal Botanic Gardens, Kew, Richmond,

Surrey TW9 3AB,

UK.

ORLAY E. PLUMMER Department of Botany, University of California at Riverside, Riverside, CA 92521, USA. KAREN C. ROUBICEK NSF Network Services Center, Bolt Beranek & Newman Inc., Street, Cambridge, MA 02138, USA. (Present address: Faxon Company Inc., 15 Southwest Park, MA 02090, USA)

10 Moulton Westwood,

XVi

Contributors

GEORGE F. RUSSELL United States National Herbarium,

Department of Botany,

NHB

166,

Smithsonian Institution, Washington, DC 20560, USA.

MARY LOU QUINN PCRPS, College of Pharmacy, USA.

University of Illinois, Chicago, IL 60680,

DJAJA D. SOEJARTO PCRPS, College of Pharmacy, University of Illinois, Chicago, IL 60680, USA. CHARLES H. STIRTON Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AB,

UK.

RICHARD J. WHITE Biology Department, University of Southampton, Southampton SO9 3TU, UK. PETER J. WINFIELD Scottish Agricultural Science Agency, East Craigs, Edinburgh EH12 8NJ, UK. JAMES L. ZARUCCHI Missouri Botanical Garden, P.O. Box 299, St. Louis, MO 63166, USA.

CATHERINE ZELLWEGER Conservatoire et Jardin Botaniques, Genéve, CH-1292 Chambésy, Switzerland.

1.

A Global Plant Species Information System (GPSIS): ‘blue skies design’ or tomorrow’s workplan? FRANK A. BISBY Biology Department, University of Southampton, S09 3TU, UK

Southampton,

Abstract

The idea of planning a global diversity information system (GPSIS) was minuted at the first TDWG meeting in 1985. It was dubbed ‘the blue skies design’ and thought of as an abstract issue not practical in the near term. Our progress five to eight years on, both in technology and know-how, is such that a global system clearly is a real possibility. If there are difficulties in creating one now they arise from professional uncertainties—exactly how would the knowledge be best agreed and organized—and scientific policy issues—how will members of the community, both individual and institutional, co-operate in the venture. The design issues can be structured as a series of questions at five levels: user demand, botanical sources, system configuration, logical design, and project management. When these design questions are sorted out there remains one central challenge: will the global system remain a blue skies design or will it be turned to reality in tomorrow’s workplan? The questions in this paper provided the framework for the Delphi Symposium. Nearly all the issues are dealt with in the chapters that follow, and the ensuing project developments are outlined in Chapters 30 and 31.

Blue skies design or tomorrow’s

workplan?

The minutes of the very first meeting of TDWG record that in 1985 we discussed what was then thought to be a futuristic view: maybe one day many projects might combine to create a unified, ideally organized, © Sytematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by FA. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 1-6. Oxford University Press, Oxford.

2

F. A. Bisby

reference system for all plants. It seemed far from practicality. Minute 3.1 continues: ‘The present discussions implicity recognised that the plant taxonomic profession is involved in producing such a system and (rightly) concentrated on very practical moves towards co-ordination. But it would be of interest to consider the theoretical “top down” design that would be generated by designing the system afresh using information technology. Would the design generated be totally impracticable for reasons of institutional tradition, traditional responsibilities etc. or could parts of it be implemented in radical moves towards a 21st century system? The suggestion was dubbed the “blue skies design” by participants who thought there was a case for devoting a separate well-prepared meeting to this topic.’ The present symposium at Delphi is the occasion planned then, and now labelled ‘Designs for a Global Plant Species Information System’. Three of the 12 participants in 1985 are here now to participate in the debate—Dr Brummitt, Professor Heywood, and myself. But one thing has changed since 1985. The technology clearly now is available to make such a system a reality. Relational data structures, tailored software, international

networks, remote data entry, online access—these are everyday actions in commerce and science. What we lack is the international organization to design, implement and maintain such a system, and that is what this symposium is about. We need to design such a system at five different levels, and these provide the themes for the five days of the symposium (now the five sections of this book): 1. Demand for a global plant species information system. Botanical decision-making and data-collection strategies. System design. Data structures and software. ae Practical steps to establish a system.

The purpose of this introductory chapter is to focus attention on the issues, and to frame what I believe are the important questions at each level. Demand for a global plant species information system

Before delving into issues of design and management we must focus clearly on what is needed. What do the potential users of a global plant species information system want? Or indeed do they want it at all? We have invited experts from a range of disciplines to respond by explaining the interactions between their work and plant species diversity information. They come from conservation, agriculture and range management, natural products and pharmacognosy, environmental survey, genetic resources, plant breeding, and biotechnology.

GPSIS—‘blue skies design’ or workplan?

3

Two key design questions are:

(1)is it the taxonomic backbone which is needed, or is it the ‘value-added’ descriptive and diversity data; and (2) is the demand primarily for a single preferred taxonomic view, or do the users want to see a range of taxonomic views? The taxonomic backbone referred to is of course the synonymized species checklist plus taxonomic hierarchy, the core reference system on which species diversity information is hung. At a logical level most of the taxonomic community would argue that the backbone must anyway be organized first and descriptive data added afterwards, a sequence equivalent to Phase 1 and Phase 2 of ILDIS (Zarucchi et al., Chapter 13). But this logic has not been so obvious to other botanists, many of whom have embarked on species databases by simply recording species names as they record their data. So we can rephrase the first question in a couple of ways. Which is more important, to sort out the species checklist data properly (names, synonymy, hierarchy); or to go ahead, even in a rough and ready system, with providing descriptive data about the species? Would a well-organized species checklist have value to the biological community in its own right, or does it only become useful with associated data? The second question, on a single or alternative taxonomies, also has a number of angles to it. Taxonomists themselves are divided, some feeling that a practical responsibility we have is to provide a stable reference system for others to use, and for this we must strive for a single preferred system at any one time. Others think the debates on alternative taxonomies are both interesting and factually useful to users so that a useful system may present several taxonomies for the same organisms. Taxonomic databases and software designers have for the most part preferred the relative simplicity of linking everything to a single taxonomy. Neutral presentation of alternative taxonomies is clearly possible, but it will be interesting to see whether such complexity can be made sufficiently user-friendly for casual use by non-taxonomists. But the impetus here lies with the users. Would a system with a single preferred taxonomy suit 50, 90, or even 99 per cent of users? For what percentage of users would such a system actually be better than a system with alternative taxonomies? Botanical decision-making and data-collection strategies

There are three interlinked questions here. First, how can we acquire the basic data (perhaps taxon names, geographical distribution, and some descriptive data) in electronic form, either from existing databases, or by data-entry using books and herbaria as sources? Second, how and where can the important taxonomic decisions be taken? Who has the final say

4

F. A. Bisby

as to the system adopted for each genus? Third, how can data-collection and taxonomic decision-making be organized in the existing community of institutions and projects? We should surely build on what is already available or in progress, both in terms of involving existing Flora and Family checklist organizations, and in terms of small, medium, and large institutions. Key questions on botanical decision-making depend on the attitudes towards preferred or consensus taxonomy. If a single preferred taxonomy is given then much organization of taxonomic specialists must go into creating or choosing this. System configuration—machines and communications

We do not have just the botanical and logical complexities which would apply if a single database were to be created at one place by one botanist. We have the additional complexity of differing contributions from multiple sites in the creation of a global system. Indeed it has been suggested that the acronym GPSIS (Global Plant Species Information System) should be altered to G2PSIS—global twice over, both in the sense of including all the world’s plants, and in the sense of servicing input and output at many locations round the world. At implementation, major system questions must be answered. Are we talking of one unified database with network communications or a communications network linking several databases? Would the database itself be homogeneous,

heterogeneous,

distributed, or replicated? What

type

of network communications would be suitable either amongst participants creating the database, or as a way of providing access? What frequency of updating is need by users? Data structures and logical designs

A core issue here is to sort out the logical links between taxa, names, and accessions. What data structures satisfy the differing needs of a taxon checklist with accepted names and synonyms, a nomenclatural list of names themselves, and an accession list of specimens with single or multiple identifications? This is not an easy issue and many of the incompatibilities and inadequacies of existing systems arise from it. Secondly there are many questions related to information exchange and co-operation in creating a single system or network. At what levels should standards be required? How can we organize the exchange of meaning, data, format, structure, etc? And lastly, how should software development and maintenance be organized? Are we seeking individual institutional and project software, a range of generic software, or standard software? Should we set up a

GPSIS—‘blue skies design’ or workplan?

5)

separate or parallel project for software development? Are we devoting sufficient attention and resources to this vital area? Funding and management

Here the first questions must be what kind of international organization is needed to provide the management, co-operation, and funding. What sponsorship and funding models are appropriate? What management structure mechanisms for project decision-making? There will inevitably be political, commercial, and intellectual property rights issues to be sorted out. Will centres in developing countries be able to make substantial contributions? Will the service provided by the system be free or paying? Will the organization be not-for-profit? What will be the principles of ownership, participation, and academic credit? The big questions

I hope that the questions given above are sufficient to illustrate the sense in which technical design issues must be debated at various levels. But no amount of examining technical and academic models will actually produce the result. To go ahead with the task, the big community questions are: 1. Are we actually going to produce such a system? In my view the time is right. The technologies and experience with-prototype projects are ready. I also think that, with the present public interest in biodiversity and conservation, a world system is needed desperately and can be funded. The challenge is to the people and institutions, most of whom are represented at this symposium, to form an effective organization that will go ahead and do it. 2. Shall we co-operate in producing one system? The traditions of taxonomists participating in wide-ranging international co-operative activities are so deeply ingrained that all of us naturally think of establishing one co-operative network worldwide. But we must be aware that differing factions, competing institutions, and regional funding initiatives, may all put pressure to go ahead unilaterally or in competition. I believe the resources of taxonomists’ knowledge and the project funding available are so slender that we must surely make every effort to agree a single organization if at all possible. 3. Do we have the staying power to introduce the necessary level of international organization? It was Hugh Synge (personal communication) who commented that as there are only approximately a quarter of a million flowering plants—about the number of entries in an average telephone book—it is surely not a major task to list them all. In one sense he was

6

F. A. Bisby

wrong. It is actually not easy to list them, and some of the problems come from the lack of higher level taxonomic and international organization in our profession. Putting some such higher-level organization in place may be difficult but it is necessary if the accumulated core work of a substantial profession working since the time of Linnaeus is to be collated and delivered as a common service. So the challenge underlying all of the symposium is this: will the global system remain a blue skies design or will it be turned to reality in tomorrow’s workplan? Postscript

This chapter is printed as given at the symposium in 1990. Nearly all of the issues are dealt with in the chapters that follow. Some are answered decisively, such as the value and urgent need for a computerized species checklist (see Gyllenhaal et al., Chapter 4): others are debated to and fro, such as the interplay of preferred and alternative taxonomies (see McNeill, Chapter 10, and Zarucchi et al., Chapter 13) and different management scenarios (see Russell, Chapter 29). In the event the ‘big questions’ listed here resulted in the challenge being accepted by many in the community: Chapter 30 is a ‘call to action’ that followed the symposium, and Chapter 31 describes the IOPI Global Plant Checklist and Common Directory, projects that are going ahead.

PART

1

Demand for a global plant species information system

tbh

pers a.

ad. ei grr

twee:i

en pd, ioe om

. i sy of pnafnetell om lormat re Cheer Sar.exti

©

hi

Chapter 4) qiheun haere rb ae

at.) Charter th

(Ree

Te)

Bh

=

Ob. Chagter PV "in the poco he2 »

(ye

oheeticm

=

'

4

orem

nt lon te a)

'

fe

roa

‘8%

an

99 2 alas



2.

The need for a worldwide botanical

reference system GEM LUGCAS Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AB, UK

Abstract

The taxonomists of the world have been struggling to identify, name and classify the world’s flora now in a systematic way for nearly 250 years. We have been very conscientious; there are nearly one million names in the Index Kewensis to prove it. However, we all know there are probably only about 250 000 species covered by those names and of course there are still many species to be discovered, named, and classified. What

must we do, therefore, to help the user community

to have full

access to our combined knowledge to help them with their specific needs? Simply, they need a basic system that identifies the entities, gives them a name, and shows how they relate to one another, and what their distributions

are. How simple it sounds—but it doesn’t exist. Because this system doesn’t exist, others (lawyers and legislators) are taking the decisions and in doing so are putting taxonomists to one side. We have to take the high ground again by providing such a system: a reliable and permanent system. But what is essential is that the taxonomists design and run the system. Hence one element of this programme is the Species Plantarum Project. It must be an international collaborative venture to show that we, the taxonomic community,

can provide what the world

needs. More importantly we need to start now. sooner it is started the sooner it will be finished.

It is a big task and the

As Keeper of the Herbarium at Kew I receive a large number of letters seeking help with the identification of plant collections essential to people’s research. The staff of the Herbarium also receive many personal requests from colleagues around the world to name ‘just a few specimens’. Kew often also receives large unsolicited parcels of plants, colour slides, and © Systematics Association, Special Volume Information

No. 48, ‘Designs for a Global Plant Species

System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst,

9-12. Oxford University Press, Oxford.

1993, pp.

10

G. Ll. Lucas

pictures with a note saying ‘Please can you name these; we will be visiting Kew next month for the names’, or (increasingly) ‘Please fax the names by return’. This large and increasing number of requests reflects a rapidly rising level of interest in all things environmental. All botanical researchers need names for their work, either to provide a baseline against which future changes can be judged, or as a key to the existing literature. The most sophisticated database cannot help if there is no name to use as a Starting point. All too often it is assumed that names can be provided easily and quickly—and at no cost. How many projects include a budget head ‘for naming’? There is, of course, a broader community of pharmacologists, plant breeders, ethnobotanists, foresters, wood technologists, and administrators

of conservation law, to name a few, who need good-quality botanical data and unambiguous names, and are often ready and willing to pay for this information.

But often we, as taxonomists,

cannot

deliver the

information needed. Different names may be in use for a single taxon in different areas, names may change with time, and large groups may not have been studied for years, so that we may still be dependent on taxonomies developed a century ago on the basis of inadequate material. Such users, not surprisingly, tend to lose patience with us; they are looking for unambiguous names which change as little as possible, and a clear system of linking older synonyms to currently accepted names. These apparently simple requirements must be seen against the reality that taxonomists are, in many countries, a dying breed, despite the fact that the demand for their work is growing. Taxonomy is fundamental to the whole of botanical science, but some biology students hardly see a whole plant in their entire courses, and it is, in Britain, an unfortunate paradox

that in a time of maximal environmental awareness, the number of people who can name plants (and, for that matter, other organisms) is probably declining. On a world scale, although we have been describing new species since Linnaeus in 1753, the Index Kewensis team at Kew make some 6000

entries every year—and there is no evidence that this number is declining. Many workers, unaware Of the many inherent defects of the Index Kewensis (particularly its earlier parts), regard it as holy writ. However, we suspect that the 950 000 names in the Index Kewensis may well represent only about 250000 flowering plant species. This emphasizes both the magnitude and the urgency both of a cleaning-up operation on the Index Kewensis, and the need to develop a satisfactory worldwide botanical reference system. However, the most important point is that we have to make our taxonomic knowledge far more easily available to all users. Who are these users, and what do they want? I see a user as anyone who needs a name onto which to hang any other information. How often have you had a horticulturalist come up to you, when they know you are a taxonomist,

A worldwide botanical reference system

Ad

and say ‘Why do you keep changing the names? Is it to keep yourselves in a job?’ This kind of question shows that we have to work hard to justify the changes we make, not only in terms understandable to the two or three other specialists in ‘our’ group, but to the wider botanical community. How often are we asked for the distribution of species ‘x’ in the world, and an assessment of its rarity? Often such questions cannot be answered satisfactorily, as when taxa which on the world scale are clearly part of wide-ranging and variable species are maintained as good species in one or more countries for nationalistic reasons. Because we cannot always answer such questions, because we cannot at present provide a basic system that identifies the entities, gives them a name (one name), shows how they relate to one another, and shows what their distributions are, there is a danger that these decisions will be taken out of the hands of taxonomists and placed with lawyers and legislators. The Council of Europe list of threatened plants was based on Flora Europaea, a great work of international scholarship and collaboration, which attempted to provide an overview of the plant species of Europe. What better document could there be for European legislators to use? In retrospect somewhat naively, the Council of Europe sent their preliminary list, based on Flora Europaea, to the governments of Europe as well as to the botanists. The botanists improved our status records, and added and deleted names by using their knowledge of the species on the ground. But then the annotated lists started to arrive from the governments. Why were species 1-10 not on the list? We replied confidently that they were now synonyms of x, y, and z, and so were covered. But this was not good enough for the lawyers. The species protected by legislation were 1-10, not x, y, and z so, therefore, please, to fulfil these users’ needs, the former

names had to be added to the list before there could be any agreement to its publication. There are many more examples in relation to CITES (the Convention on International Trade in Endangered Species). We, the taxonomists, need to make sure that we retain the initiative in

these matters. We must provide a system that is reliable and preferably permanent, but without attempting to impose anything which will stifle further taxonomic research. We must design a world system, and run it for our colleagues and for the non-taxonomic users. We have the knowledge and, thanks to recent and continuing improvements in computer hardware and software, we can draw the threads together. The Index Kewensis gives a good example of what is possible. This project was launched nearly one hundred years ago with money provided by Charles Darwin. Dedicated staff, rarely more than two at a time, have combed the literature quietly and calmly, and one sees the resulting volumes in most botanical institutions. Working through the 20 volumes is a problem, and to celebrate the centenary, all that work will become available on a small CD-ROM disk,

12

G) Lin Lucas

which will immediately make the entire list available to anyone with the necessary equipment to read it. The Index Kewensis already opens many doors to our knowledge; an annually updated CD-ROM disk will make this knowledge even more accessible. That is why we should now begin on one major aid to the needs of our colleagues—the Species Plantarum Project. Its initial stage, a checklist of the world’s flora, must be prepared as an international collaborative venture to show that we, the taxonomic community, can provide what the

world needs. We need to start on this NOW. It is a big task, and the sooner it is started, the sooner it will be finished.

3.

US interagency botanical data applications, needs, and the PLANTS database J. SCOTT PETERSON National Plant Materials Center, USDA, Soil Conservation Service, Building 509, BARC-East Beltsville, MD 20705, USA

Abstract

The broad array of federal agencies within the United States of America require botanical data of various kinds. These data are used for tasks as varied as land management, plant materials development, germplasm resource management, wetland boundary determination, plant inspection services, and plant species management. Of principal importance are nomenclatural data regarding taxa of North America and the world floras. Additional attributes of importance include distributional, taxonomic, ecological, and

growth-form data. At the present time, nine agencies are cooperating with the Soil Conservation Service, US Department of Agriculture to develop the

Plant

List

of Attributes,

Nomenclature,

Taxonomy,

and

Symbols

(PLANTS), which functions as a conduit for data from the botanical community, part of the INFOSHARE initiative to provide easy access to USDA data.

Interagency survey With the information age well implemented and a personal computer on each desk, the many US Federal agencies require standardized botanical

data. This permits the unhindered exchange of information, and in this time of tight budgets, reduces the duplication of effort. Activities of various US agencies that require botanical data include the following. © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 13-19. Oxford University Press, Oxford.

14

J. Scott Peterson

1. Soil Conservation Service

Within the Department of Agriculture, the Soil Conservation Service (SCS) works with public and private land users to place soil and water conservation practices on the land. SCS is involved in the development of new

cultivars, some

from germplasm

from outside

of the US, the

installation of conservation practices, and designing resource management plans. The SCS administers many of their responsibilities at the state level with field offices in each county or county equivalent. Botanical data are utilized in these activities, plus communicating the plans and practices to the land users. 2. Forest Service

Though a national agency, the Forest Service accomplishes most of their responsibilities at the regional level and below. One of their difficulties in the past has been the proliferation of regional lists of botanical nomenclature and symbols that aren’t easily translatable across regions or to other agencies. Some of the activities in which the Forest Service is involved which require botanical data include timber production, wildlife

habitat management, forest management, endangered plant species program, watershed management, and recreation development. The Forest Service is a driving force in the upgrading of botanical data for federal agencies. 3. Agricultural Research Service The Agricultural Research Service oversees many activities that utilize botanical data, including germplasm nomenclatural work for the Germplasm Resources Information System (GRIN) (USDA, Agricultural Research Service 1990), germplasm maintenance at the National Seed Storage Laboratory in Fort Collins, Colorado, and botanical exploration for new germplasm to be utilized in cultivar development. 4. Animal and Plant Health Inspection Service

Sixteen inspection stations are maintained by the Animal and Plant Health Inspection Service (APHIS) at points of entry into the United States. These inspectors need the latest nomenclature and keys to identify imported plants and plant products. APHIS is very interested in the future of automated keys and the exchange of botanical data with Europe, Latin America, and the Pacific Rim nations. APHIS also regulates agricultural commodities within the United States that move outside of a pest-infested area. Common names are used in dealing with laypersons on the regulation of these plant materials. APHIS also is the agency that promulgates regulations on agricultural commodities.

US interagency botanical data applications

15

5. National Park Service

Within the Department of the Interior, the National Park Service (NPS) administers national parks and monuments and is responsible for the preservation of biological diversity native to the parks. The NPS supports research on the lands it administers in order to understand ecosystem functions and identify indicator species to monitor global change and air pollution. Programs to document and monitor the impact on native plants caused by visitor use, grazing, and park management practices are large users of botanical data. 6. Bureau of Land Management

The nation’s public lands are administered by the Bureau of Land Management (BLM). These lands exist largely in the western United States. The BLM is also responsible for large areas of subsurface mineral rights in the eastern United States. Additionally, the BLM manages such diverse activities as those conducted under the Wild Horse and Burro Act. Current botanical nomenclature and plant symbols are required for range trend surveys for use in evaluating the animal carrying capacities of the land. The BLM is also responsible for collecting and evaluating environmental data associated with permitting oil-shale mining and development on public lands. 7. Fish and Wildlife Service The Fish and Wildlife Service (FWS) is responsible for wildlife habitat development and refuge management as well as endangered species protection. Nomenclatural data are especially critical in this effort, since the recognition of a taxon can determine whether it will be listed and afforded protection. The Fish and Wildlife Service also has joint projects, such as one with the states of Alabama and Mississippi and the Army Corps of Engineers on wetland habitat reconstruction. 8. Army Corps of Engineers

Within the Department of Defense, the Army Corps of Engineers is responsible for maintaining navigable waterways. Also, it is the initial agency to which applications are sent for permission to develop property in ways that impact wetlands. The delineation of wetlands and the review of permit applications require data on current botanical nomenclature, access to specialized keys, and other identification tools.

9. Environmental Protection Agency

The Environmental Protection Agency (EPA) is responsible for promulgating and enforcing laws that address and resolve sensitive environmental

16

J. Scott Peterson

issues. These laws encompass such topics as solid waste management, water quality, wetlands protection,

toxic waste disposal, pesticide regulation,

marine protection, environmental assessment, and air quality. All of these activities require a wide array of information on the members of the plant kingdom. The EPA also performs a considerable role in reviewing the wetland activities resulting from the permits issued by the Army Corps of Engineers. A principal user of taxonomic data is the National Wetlands Inventory (NWI) project, a cooperative development among the Soil Conservation Service, the Army Corps of Engineers, and the Environmental Protection Agency. The NWI project has developed a list of plant taxa known to occur in wetlands in the United States and is currently in the process of automating descriptive data for computer identification. 10. Smithsonian Institution

The staff of the Department of Botany, Smithsonian Institution function as both producers and users of plant taxonomic data. Their research focuses on plant systematics in the broadest sense: taxonomy, nomenclature, investigations on comparative anatomy and morphology, cytology, palynology, phytogeography, ecology, evolutionary theory, and economic botany. Numerous floristic studies are under way, while others are aimed at elucidation of evolutionary development, phylogeny, and the broad

questions

Index Nominum generic names.

of classifications.

In addition,

they maintain

the

Genericorum, a database of all validly published plant

United States federal agencies are just beginning to realize the magnitude of their requirements for botanical information. Generalized categories include the following: nomenclature encompassing the world’s botanical literature; vernacular names for communicating the botanical world to the non-scientist program users; plant symbols that are utilized for collecting field data and computer input; generalized ecological data, such as broad habitat categories; biological data for general use again include broad categories such as duration; and geographical data that include political entities such as nation, state, and county of occurrence as well as geographical point data to permit mapping of specific locations. Species descriptive data are also being compiled for entry into software programs, such as MEKA and DELTA, for the implementation of automated keying. At the present time, this information is being assembled by the Fish and Wildlife Service, Soil Conservation Service, Corps of Engineers, and the Environmental Protection Agency to produce automated keys for wetland plants.

US interagency botanical data applications The PLANTS

17

database

The National List of Scientific Plant Names (USDA, Soil Conservation Service 1982) was the first list that became a de facto federal standard. It began its official existence in 1971 and developed into a two-volume SCS publication produced by the Smithsonian Institution. That product has long since been out of date and was not easy to use. Due to the growth of databases requiring correct botanical data with which to link land management data, the Soil Conservation Service initiated a permanent project to revise and upgrade the list and maintain it over the long term. This

project,

entitled

the

Plant

List

of Attributes,

Nomenclature,

Taxonomy, and Symbols (PLANTS) is being developed in cooperation with other agencies and private/public organizations. PLANTS has been envisioned as a conduit for botanical data from the botanical community to the federal agencies (Peterson et al. 1993). In the broad sense, the botanical community consists of botanical specialists and researchers, the extensive literature, and hundreds of herbaria. Federal users range in scope from customs inspectors at international airports and border crossings to botanists conducting range studies on lichens on the Alaska North Slope. The goal was to have a dial-up database service established in 18 months that contained the nomenclature and plant symbols for the vascular taxa within the United States. This was accomplished as a result of major funding from the agencies noted below by an asterisk. Currently, numerous United States federal agencies are co-operating to establish common goals, accomplish the long-needed task, and reduce the duplication of effort: developing a database that will serve as their conduit for botanical nomenclature, plant symbols, and taxon attributes. Agricultural Research Service Animal and Plant Health Inspection Service* Army Corps of Engineers Bureau of Land Management* Environmental Protection Agency Fish and Wildlife Service Forest Service* National Park Service Smithsonian Institution Soil Conservation Service* USDA, National Computer Center*

A number of organizations from outside of the federal government are cooperating in this project. The Biota of North America Project (BONAP) has its headquarters at the University of North Carolina, North Carolina Botanical Garden and is serving as the foundation data-source for the

18

J. Scott Peterson

nomenclature utilized in PLANTS. The data furnished by BONAP were derived from the literature with review by over 600 specialists knowledgeable about the vascular flora of North America. A Memorandum of Understanding has been developed between the SCS and the Flora of North America (FNA) project located at the Missouri Botanic Garden. This memorandum will underlie co-operation between the FNA and the agencies well into the next century. This effort will ensure that the products of the FNA are integrated into the activities of a major user and provide an avenue for co-operative federal agency/FNA projects. Additional agreements have been developed with organizations such as the Rocky Mountain Herbarium at the University of Wyoming, the Southeastern Regional Flora Information System at the University of Alabama,

the Intermountain

Herbarium

at Utah State University,

and

the University of Nebraska. Many other agreements follow these pilot efforts. The pilot projects have been initiated to address the distributional data at the county level and relate each entry to one or more specimens in recognized herbaria. The PLANTS project also contains another important facet of the plant kingdom: non-vascular plants. The agencies have been slow to recognize their importance, but that is changing. For example, in Alaska, lichens are being given the recognition they deserve by land managers and even utilized as plant materials.

Conclusion

Most US federal agencies utilize botanical data for land management, regulation, monitoring, or related data accumulation. Considering the importance of botanical information to most US federal agencies, a global plant species information system is of urgent necessity for decision-makers to provide sound direction to today’s activities. A global view is even more important when one considers the impacts of our actions today that transcend political boundaries. This is exemplified by the view from the space shuttle that looks down upon all of us closely associated and interdependent neighbours.

References

Peterson, J.S., Dodd M. L., and Rieux M. (1993). PLANTS database user’s guide, version 2.0. U.S.D.A., Soil Conservation Service, Beltsville,

Maryland. USDA, Agricultural Research Service (1990). Data dictionary: Germplasm

US interagency botanical data applications

19

resources information network (GRIN). Prepared by Database Management Unit, Germplasm Services Laboratory, Plant Sciences Institute, Beltsville, Maryland.

USDA, Soil Conservation Service. (1982). National list of scientific plant names. Volume 1: List of plant names. Volume 2: Synonymy, SCS-TP159. US Government Printing Office, Washington, DC.

4.

NAPRALERT: problems and achievements in the field of natural products CHARLOTTE GYLLENHAAL, MARY LOU QUINN, DJAJA D. SOEJARTO, and NORMAN R. FARNSWORTH Program for Collaborative Research in the Pharmaceutical Sciences, University of Illinois at Chicago, Box 6998, Chicago,

IL 60680, USA

Abstract

NAPRALERT contains

(Natural Products ALERT) is a specialized database which

information

derived

from

the literature

on

medicinal

folklore,

biological activity, and the chemistry of organisms of potential or actual pharmaceutical interest: plants make up about 75 per cent of the organisms included. The database is relational in design and is managed under the INGRES database management system. Taxonomic information on the organisms covered is stored in a fairly detailed manner. All names are checked against standard taxonomic references before being entered in the database, in an attempt to resolve, as far as possible, the problems inherent in taking names from the natural products literature. In the instance of synonymy all names involved are edited into a single accepted name; the synonyms are stored in an auxiliary table. Users of the database are mostly in the areas of medicinal plant research or drug development, in the herbal medicine industry or in the flavours and fragrance industry. Several standard search requests are used, the most popular being the ‘three-part profile’ which includes all available ethnobotanical, biological activity, and chemistry information on one organism. The system is now on-line but still offers the off-line service. Scientists in developing countries are provided NAPRALERT services free of charge through an arrangement with WHO (World Health Organization). © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 20-7. Oxford University Press, Oxford.

NAPRALERT

21

While many difficulties are encountered in collecting the taxonomic and nomenclatural data for species listed, we do have a well-functioning database that meets the needs of our users. The authority of a globally inclusive plant species information system would be of major assistance in rationalizing and updating our taxonomic schemes, and in increasing the efficiency and accuracy of our work in selecting the names for organisms cited in our database.

Introduction

NAPRALERT is a database containing information from the literature on natural products: it covers the chemistry, medicinal folklore, and biological activities of plants and animals from a drug discovery perspective. The database was started in the mid-1970s by Dr Norman Farnsworth at the University of Illinois at Chicago, who has overseen its growth and development since that time (Farnsworth 1983a, b, 1985; Farnsworth and Loub 1983; Farnsworth et al. 1981; Loub et al. 1985; Soejarto et al. 1989). The structure, data acquisition, and data verification procedures employed in the database are all oriented towards the needs of its end-users, whose chief concerns are the development of new drugs from natural sources, the chemistry of natural products, and the validation of traditional medicinal uses of plants. The acronym NAPRALERT stands for NAtural PRoducts ALERT and

summarizes the aim of the database: to alert the user to as wide a range of information as possible in the natural products literature so that he or she may possess some basic guidance in approaching this burgeoning field. The vast extent of the literature in this field makes a completely comprehensive database impractical: it is beyond our resources, for example, to cover every clinical study of established drugs from natural sources, such as quinine and vincaleukoblastine.

We aim, rather, at a broad coverage of

the chemistry and pharmacologically interesting properties of plant and animal products. NAPRALERT is not a literature abstracting service. It is a relational database containing actual data derived from the literature we survey, in the form of tables. At this time, we enter information from

approximately 600 journal articles or other reference types per month. Initially, some 700 journals are perused by Ph.D. scientists trained in chemistry, pharmacognosy, and pharmacology. By going through the tables of contents of each journal issue and by scanning other abstracting services (including, for instance, Chemical Abstracts and Medicinal and Aromatic Plants Abstracts), appropriate titles are found, the papers requested from their authors, and put into our system. In cases in which articles prove to be difficult to obtain, essential data may be temporarily entered

22

C. Gyllenhaal et al.

from an abstract and the information flagged to indicate its source. Following literature selection, the reference goes through a fairly complex abstracting process, referred to as ‘coding’, since much of the information is stored in the database in coded form. Specialists in medicinal folklore, pharmacology, chemistry, and taxonomy manually transfer relevant data from the reference to data sheets and perform data verification procedures. The completed and verified data sheets are then entered into the computer by a team of data-entry specialists. Over 90000 articles or books are now referenced in NAPRALERT. Some 87 000 pure chemical compounds are listed. About 38 000 scientific names (Latin binomials) of plants and animals are in the database; of these about 75 per cent are higher plants. We also maintain a file of synonyms, currently containing over 4600 records. There are approximately 426000 records associating plants or compounds with biological activities such as antimicrobial activity, sedative effect, and enzyme inhibition or stimulation.

NAPRALERT now resides on a Vax Station 3600 computer and is managed under INGRES (Release 5.0, currently converting to Release 6), an SQL-based relational database management system. Data structure

The conceptual model underlying the NAPRALERT database is based on the premise that articles or books in the natural products literature contain information about organisms from which chemical compounds may have been isolated. The organisms, or compounds isolated from them, may have biological activities. We enter information into the database on the articles, the organisms, the compounds, and the biological activities. The general structure of the database tables can be summarized as shown in Fig. 4.1. BIB is the bibliographic information. Each reference entered in the database is assigned a unique alphanumeric code referred to as the citation number. This citation number serves as the primary key in the database and is used to relate the various data tables to each other. ORG in Fig. 4.1 is the information on the organism or organisms found in a particular literature reference, and is cross-referenced to BIB by means of the citation number. Each organism in a reference is sequentially assigned a number referred to as the ‘onum’ or organism number, which serves as a secondary key to correlate the organism with biological activities or compounds in the same article. ACT is the information on biological activity pertaining to the organism and CPD is the information on compounds. In this table appear both compounds whose isolation from a particular organism is discussed in the article, and compounds known to be derived from a particular natural product which are discussed in an article simply as pure compounds, without reference to the organism from which they

NAPRALERT

Fig. 4.1. NAPRALERT

were isolated. ORG,

J)

data structure: interrelationship of various data types.

ACT,

and CPD

are linked with each other by the

worktypes, WTY, which indicate the type of experiment (for example, isolation of a compound from a plant, biological activity of a plant, biological activity of a pure compound, medicinal folklore, biosynthetic analysis) or compound identification methods (HPLC, IR, UV, NMR, etc.). In order to save disk space, much, but not all, of the information is entered and stored in the database tables in coded form. The actual table structure of the database is, of course, more complicated

than what is shown in Fig. 4.1. There are 14 tables in the main database, plus several accessory tables including tables of taxonomic and chemical synonyms. The database tables are accessed only by individuals responsible for managing the database. Users either perform standard ‘canned’ queries or access views of the database in which the table structure is simplified. Views are extracts or combinations of tables which perform as database tables for retrieval purposes but do not permit direct user access to stored data. A discussion of the views serves to summarize in a fairly comprehensible manner exactly what types of information are included in NAPRALERT without entering into the details of how the database is managed. 1. Contents of view and tables The bibliographic view contains the citation number of the reference, the name of the journal or book referenced, the volume, issue, first and last

pages, year of publication, language, article type (research article, scientific review paper, letter to the editor, etc.), abstract information if the data are taken from an abstract, the article title, the author or authors of the

reference, and the date of entry in the computer.

24

C. Gyllenhaal et al.

The pharmacological/biological activity view, contains three key fields: the citation number, the organism number (which relates the activity to the particular organism in the reference from which the data were taken), and the pharmacology record number (a sequentially assigned identifier given to each of the activities pertaining to the organism). These three fields are linked to the ‘mpa/spa’ code or major pharmacological activity/specific pharmacological activity code. This code designates first the overall type of activity, such as nervous

system effect, fertility regulation effect, or

chemotherapeutic effect, followed by the specific activity, for example, thymidine uptake inhibition, histamine release inhibition, or galactogogue effect. There are over 2000 such codes used in the NAPRALERT system, allowing users a great deal of flexibility and specificity in tailoring the types of searches they perform. Data connected to the ‘mpa/spa’ code are the extract type, such as the mode of administration, the animal dosed in the experiment, its sex, the frequency of dosing, the dosage type (concentration, LD) IDENTIFY COMMAND ORDER FAMILY SUBFAMILY GENUS SPECIES

—> —-» potamogetonaceae —» ——> —>

DISTRIBUTION —> etc.

ROOT TYPE SHOOT HABIT SHOOT HEIGHT LEAF SIZE PEAR SARE. PETIOLES CUTICLE etc.

= —» —> = > —-» perfoliate =

N

Fig. 20.3. A possible screen design for the IDENTIFY application program.

2

208

M. Everard

be distributed along with selected application programs through a variety

of media (floppy disk, CD-ROM, tape, downloading through a network) to enable remote customers to access data from a computer system at their own location. If the customer were able to download through a network,

the principal disadvantage of keeping the local copy up to date would be circumvented. Were the database selected to be of a type with versions available on a variety of operating systems, customers would be relieved of the necessity of purchasing a specific type of computer, provided that the system that they currently owned could run compatible database software. Application programs designed to make specific queries of the database would also be portable. SQL supports this requirement. (c) Postal or telephone enquiries Queries by post or telephone could be handled by selected members of the database administration team, granted either on-line access or access to a local copy of the database. The database administration team member(s) assigned to answer these queries would select information from the database to meet requestors’ needs. A precedent for this type of database use has been established by the Monitoring and Analysis Research Centre (MARC), who answer queries routed from the Nairobi node of GEMS. (d) Publications and reports Publications and reports may be produced in a professional manner using spreadsheet or publishing software to process database output. These data presentation facilities could be connected to the suggested data presentation interface. When the system is fully operational, database administrators may, for example, be able to present a response to a postal query in the form of a species list for a given geographical location, as a pie-chart, or as artwork, or as a cartographic representation.

3. Database administration and editing The role of a database administrator (DBA) is to manage security, updating, interconnections, and all services pertaining to the database complex. DBAs would comprise a core team of taxonomic experts and administrative specialists sharing the administrative role according to their specialities. Principal editors share a central role with DBAs. Indeed they may very well also be DBAs. They comprise a selected group of taxonomists concerned with developing the technical content of the database. Principal editors would also co-ordinate work carried out by guest editors. Guest editors fall into two categories: those editing the host database and those editing an off-line copy of selected data (either in hard- or softcopy form). Some form of security system would clearly be a necessity to

Design of global plant taxonomy database

209

grant guest editors working on the host database temporary access to restricted database tables. Revised files or marked-up drafts returned by guest editors performing off-line reviews can be incorporated into the master database by principal editors or DBAs. The same principle will also apply for review comments submitted by electronic mail, phone, word of mouth, the postal system, or other means. While we should expect DBAs and principal editors to have database programming skills, some form of interface technology such as an expert system could be developed to assist guest editors to enter data into the database without having to learn the intricacies of database or data structures. DBAs and principal editors would need permanent read/write access to the entire database, and should ideally be literate in the selected database control software. However, the emphasis should principally be on taxonomic skills, and the simplicity of operation of SOL implementations recommend them for this purpose. Many SQL implementations also allow administrators to add new tables, add new columns, and implement new

application programs at any time without reorganizing data or recompiling existing application programs. Some SQL implementations also have automatic data back-up and recoverability functions, and enforce their own referential constraints: deletion, insertion, or updating must be allowed

only when related data in the database are also updated, either manually or automatically. The automation of potentially complex administrative functions relieves the DBA team from too great a dependency upon programming skills. A security system used in conjunction could be used to allow DBAs and principal editors to make private annotations in the database visible only to the DBA/editorial community, for example when considering levels of confidence in data. Many SQL implementations, most operating systems, and external security program products offer provision for security control. If the editorial team reaches a level of activity where multiple concurrent updating of data presents a risk to data integrity, a transaction processing system (TPS) may be useful. There is a discussion of the role of TPS technology later in this chapter. 4. Communication

with other databases

Intercommunications with other databases are likely to play an important role in the global plant species information system for the following reasons: the proposed information system may itself be a distributed network of database

systems,

communication

with related

database

systems

is

desirable, and coexistence with existing databases may allow access to the data they contain before migration is possible. (a) Communication

with related databases

Given the generic design of

M. Everard

210

this global plant database, it is possible to envisage a family of related databases containing information about, for example, mosses, mites, or

mammals. Pellew and Harrison (1988) recognize also that, for cartographic data, a central database might request data from a biodiversity database at a national

level, if that national

database

contains

data at a finer

degree of resolution and if that finer resolution is required. Two-way intercommunication between the global plant species information system and, for example, the GEMS database may also be mutually advantageous. Indeed, many current and planned databases could eventually become the components of a much larger global biota information system. (b) Coexistence In the interests of rapid productivity, the global plant species database may become operational before it is fully implemented. For this reason, the core database may require a dialogue with existing plant species databases from which source data will eventually be migrated. 5. Migration of data If a new database is set up, it is probable that much of its contents can be migrated from existing databases. One of the requirements of the database implementation chosen is that migration of data from a variety of sources should be easy. Some SQL implementations and related software tools allow migration of data from a variety of different formats. It is incumbent upon the design team to identify where data should be migrated into the host database, and where communications should be established with an

existing database implementation. 6. Business considerations

Funding is one of the major hurdles that must be overcome for a global plant species information system to become a reality. A close relationship with an international organization may have advantages in attracting funding and source data. Project ownership by a national government or by a commercial corporation might lead to concerns amongst potential financial or information donors about the political interests of a host nation, or the commercial interests of a host company. Pellew and Harrison (1988) estimate that at least 50 per cent of the funding for an international database project must be free from controls, as sponsors often hold unrealistic expectations of the time taken for such projects to show material returns, and the data themselves are a financial liability as they are dynamic and require constant updating. At the same time, a quick payback in terms of database operation is an incentive to customers to use the database, to attract further financial support, and to stimulate the donation of data. Planning for financial support is beyond the scope of this paper. Nevertheless, provision must be made in any planned database system for

Design of global plant taxonomy database

PAllih

charging selected categories of customers, as retrospective support for such features could prove costly to design and implement. It is also assumed that selected customers (for example schools, governments, or international agencies such as the United Nations) would be able to make free use of the database system. Therefore, there is a requirement to provide a graded charging system. The following considerations ensure that system design includes contingency planning for possible charging strategies.

(a) On-line access Charging for online access to data could be operated on a license or an auditing basis. Upon payment of the appropriate license fee, the DBA would give the licensee authorization to read the whole database,

or selected views (depending on license type). Read access may be granted for only a limited period, renewable on receipt of a subscription fee. The auditing strategy requires that the information system runs an auditing facility to record all data queries instituted by the customer. A periodic bill for services could then be sent to the account holder. Many operating systems, transaction processing systems, and SOL implementations contain auditing facilities. Theoretically, it is also possible to implement a comprehensive charging policy for each of the co-operating databases that the customer’s request demands access to. The details of this are clearly for consideration later in the development cycle of the proposed information system. (b) Access to local copies of the database Access to local copies of the database could be controlled by subscription. Upon receipt of the customer’s subscription, a copy of the current level of database (or an appropriate subset of it) could be sent to the subscriber.

(c) Postal and telephone queries Postal and phoned enquiries could be billed to customers holding accounts with the database administration. (d) Printed output Printed output is easy to charge for. Books or periodical articles produced on the system could be sold commercially, while customized publications built to order could be billed in the same manner as postal queries. (e) Data origin When planning to charge database customers, careful consideration must be given to data ownership. If data are migrated from existing databases into the proposed global plant species information system, it would not be ethical for the controlling organization of the new information system to charge other customers for the data themselves. Charging for the data per se may in itself be illegal in some countries. One

212

M. Everard

has instead to charge database customers for data retrieval and presentation services. A precedent for billing for search and retrieval of information, while not charging for the information itself, is set by the Hydrographic Office of the United Kingdom’s Ministry of Defence (MOD). It is the policy of the Hydrographic Office to charge a ‘search fee’, covering the cost of staff time in looking for, and printing, the information required (plus postage and tax). The International Union for Conservation of Nature and Natural Resources (IUCN) also maintains a policy of not selling data (Pellew and Harrison 1988), although it does market an information service based on the analysis and interpretation of the data. Extending this policy, a sliding scale of charges could be levied on global plant species information system customers depending upon the complexity of output required, time spent by database staff in preparing output, computer processing time taken, interpretations of data, and so forth. The database design team should also consider provision of a facility for giving recognition to the scientist or body responsible for data collection. 7. The role of transaction processing systems

An increasing community of both customers and editorial staff will be expected to make use of the global plant species information system. Therefore, as development continues, data integrity will be increasingly at risk and users will increasingly compete for computing resources. Transaction processing systems offer a solution to these problems. They form what may be considered a second layer to the operating system, scheduling and prioritizing transactions, and protecting data integrity during multiple concurrent data handling operations. Advanced transaction processing systems like IBM’s Customer Information Control System (CICS) also have a variety of built-in features that may significantly improve system integrity and function, for example CICS handles communication with a variety of types of connected devices and protocols (including two-way ASCII to EBCDIC data conversion for data communication between workstations and host systems). The CICS application programming interface (API) is also an industry standard; simple to use, available across many different operating system platforms, and with a wide skill base. 8. Glossary

To allow editorial staff and customers to make the most efficient use of the information system, the design team may wish to implement a glossary. If the glossary is invoked when the user hits the HELP key, and it contained both term definitions and a list of permitted

Design of global plant taxonomy database variables for data fields, then system usability would enhanced.

213 be significantly

9. Database internals

Datailed comment on the internal design of the database is beyond the scope of this chapter. Bisby (1984b) and Pankhurst (1988) have already summarized many of the difficulties relating both to data structure and database structure in taxonomic systems. However, I would emphasize the value of assembling a multidisciplinary working group to address design. Taxonomists obviously must be represented to address purely taxomonic problems, for example how to add a grex to the database, given that the species does not occur in nature, belongs to a man-made genus, and has a different naming syntax to naturally occurring species. The provision of a backbone list of plant species is obviously a prime requirement from taxonomists, as species names are the key by which most data customers will gain access to further information. Database administrators should also participate in the working group to help define administrative requirements, to decide upon where to centralize data and when to distribute a query to a connected database, adopting the ‘one-stop shopping’ approach to addressing multiple databases as implemented in the Microbial Strains Data Network (Kirsop, Chapter 21 this volume). Representatives from the administrative teams of other databases with which the proposed global plant species information system may co-operate would also be able to make valuable contributions to the working group on this matter. Potential customers should also participate in the design phase to advise upon customer requirements both of the data and of presentation facilities. I would also strongly recommend that members from the computer industry be invited to participate in the working group to advise on interface design, performance optimization, usability, and other areas of software design expertise, as discussed in the introduction. I will provide three examples where consultation with the computer industry may be beneficial during system design: national language support, geographical information, and standardised environmental data architecture. Retrospective support for these types of features is expensive and complex to implement. Although provision of these or other features are unlikely during design of the first iteration of the global plant species information system, the cost of later implementation is minimised by provisions in the initial design. (a) National language support (NLS) to communicate with the information Software designers within the computer expectations for this type of feature,

NLS enablement allows customers system in their native language. industry will be aware of customer and have the skills to advise on

214

M. Everard

the best ways to implement them with minimum performance.

impact upon system

(b) Geographical information Should geographical distribution be plotted against relatively ephemeral country borders? If so, how meaningful are the data? A trained geographical information system (GIS) programmer would not only have the knowledge to work with taxomonists to determine the best format for geographical data, but would also be able to ensure that it is compatible with industry-standard GIS display technology. They may also be aware of related geographical projects (such as the current Royal Geographical Society ground-truthing working party) and have recommendations for compatibility of data structure between these projects. (c) Standardized

industry member be able tectures,

environmental

data architecture

With

the computer

entering the growing market for environmental databases, a of the working group from within the computer industry would to advise upon proposed industry-standard environmental archithus ensuring compatibility.

10. Designing for system evolution

The database system must have the flexibility to adapt, not only to the growing needs of its customer set but also to revisions in taxonomy. The need to plan for system evolution is inherent in the multiplicity of data types (text, biogeography, video, etc.), a growing customer community, and the probable need to add new data fields to the database during operation (for example, adding a new table to the database listing the cultivars of many, but not all, of the plant species). It is highly probable that the database will be in use long before the complete database complex has been implemented, or even fully designed. Therefore, flexibility is essential to allow progressive enhancements to a core operational information system. Flexibility and usability can best be provided for by taking a modular approach to design, as modularity allows for a logical stepwise progression in information system development, each module adding extra function to a working core. The database can then grow easily with customer needs (for example, we may wish to add tables to support the needs of herbalists), to exploit developments in computer technology (such as multimedia), and as funding permits further development. The modular approach also assists in project control by allowing planning towards a series of cumulative ‘stepping stone’ objectives. Consequent advantages are the attraction of units of funding for each well defined objective, a quick payback of results to maintain the interest of the funding body and to stimulate customers to use the system (this may also stimulate donation of data), and harnessing the energies of small but currently

Design of global plant taxonomy database

215

disparate projects towards the overall goal of an integrated information system. I would make a strong recommendation that the proposed design working group maintain their activities beyond the initial design phase. They should meet in the role of a design steering committee to ensure continued co-ordination and technical direction. A summary of requirements and recommendations

In this section, I intend to summarize all the requirements identified whilst considering the needs of the database customer and editorial communities, and to make recommendations. 1. The database

An SQL relational database implementation appears to be best suited to the needs identified as it is a mature technology with a wide skill base. There are many implementations of SQL which span a variety of operating systems, allowing portability of data and applications. They handle a variety of data types (numeric, character, data, etc.), and many implementations already have recoverability, restart, and referential constraint features built-in. Given that SQL is an industry standard, it will not become redundant but will evolve with new technology. SQL is also a relatively simple language, not requiring the attentions of a highly skilled programmer. Some SQL implementations contain auditing and security facilities, necessary for billing customers and controlling database access. 2. The host computing environment

The host computing environment must support a diversity of data presentation facilities. High-quality graphics display terminals may be essential for many customers, and a range of compatible data presentation software such as GIS, spreadsheet, graphical presentation facilities, and a publishing system would add value to the database. A rich communications environment would assist data accessibility and portability; transaction processing software may provide a convenient means of supplying this. Whilst some SQL implementations supply an auditing facility (identified as a possible method of charging online customers), operating system or transaction processing system auditing facilities could equally be used. If the project starts on a low budget, a system based on a personal computer (PC) would be the cheapest option. However, although computing power and storage in small systems increases continually, a PC solution would currently be less than ideal. The main reason for this is system performance, although data communication and storage constraints could also pose problems. If necessary, additional PCs could be connected through a local area network (LAN) to act as servers for storage-intensive

216

M. Everard

data such as video or graphics. In the event of more substantial funding, the PC solution may still remain tenable as a starting point for building a prototype system, and maintaining as an application program test system. Modern PCs such as IBM’s PS/2 range running the OS/2 operating system contain all the features identified as prerequisites. This system will also run an implementation of the CICS transaction processing system, CICS OS/2. As stated previously, a mid-range computer system running a suitable operating system, for example IBM’s VM or one of the ‘open’ operating systems, would be a better though more expensive starting point. It would offer a greater storage capacity (particularly if optical storage devices are employed), a markedly improved system performance, and a richer communications environment. Security, already identified as important in restricting customer and guest editor access to all or part of the database, is also significantly enhanced in mid-range systems. .IBM’s SQL/DS implementation of SQL on the VM operating system supports all the prerequisites identified in this chapter, and also has a sophisticated data migration facility which could facilitate migration of source data held in various formats in different databases. Security can be handled either by database software provided with security features, by the operating system security, or by an external security manager such as IBM’s RACF program. It should be borne in mind that there is no necessity for a dedicated computer system. Sharing a computer with other projects would certainly lower the start-up cost threshold considerably, allowing cheaper access to a more sophisticated computer. It would also ensure that trained computer

staff are at the host location, and also provide facil-

ities to run a duplicate system for database back-up, application testing and development, prototyping, and keeping private editorial notes on.

3. Usability features Throughout this chapter I have made several references to the desirability of a high degree of system usability. Many usability enhancement features are commonplace in commercial software but may be too complex or expensive to implement on early iterations of the proposed global plant species information system. However, they should be addressed at the initial design phase to allow for later enablement as and when funding and time permit. Initial interface design for subsequent enablement is crucial for future project management and cost, as the technical and financial implications of retrospective unplanned functional upgrades are substantial. I have also indicated that, provided editorial and data presentation

Design of global plant taxonomy database

217

interfaces are clearly defined, it would be possible to utilize expert system technology to insulate customers and editors still further from the complexities of the database. Expert systems technology is still emerging; however, the possible advantages that it can offer to biology are significant. 4. Further steps I strongly recommend that a multidisciplined working group consisting of taxonomists,

database administrators, selected customers,

and the com-

puting industry, be constituted to undertake the detailed design phase of the information system. Data structure, database structure, and interfaces

must be clearly defined to meet the needs of all parties, but also to cater for subsequent evolution. A modular design is recommended. This would also allow a phased series of objectives to be set as development goals, to act as nuclei for the attraction of units of funding, and to aid differentiation into discrete projects that may be undertaken by taxonomic teams throughout the world. The multidisciplined working group should continue to operate as a steering committee as the global plant species information system continues to evolve.

Conclusions

The design proposal in this chapter represents a generic database structure, as applicable to flowering plants as to bryophytes or mammals. It supports present requirements while providing the flexibility for system evolution. It caters for connectivity and co-existence with current databases to provide maximum productivity during development. Placing the requirements of customers first, the proposed core SQL database will run in a feature-rich operating system environment. This caters for wide-ranging methods of data retrieval and diverse means of data presentation, and with contingency for charging data customers. Further planning should be by a multidisciplined team of taxonomists, selected customers, and members of the computing industry. A modular approach is recommended.

Acknowledgements I would like to thank Dave Simmons, Mark Powell, and Dr Clare Jackson, all three from IBM, for their valuable technical advice and assistance in

preparing this chapter.

218

M. Everard References

Benning, A. L. (1990). Information science and the scientist. Freshwater Biological Association Annual Report, 58, 55-67. Bisby, F.A. (1984a). Information services in taxonomy. In Databases in systematics, Systematics Association Special Volume No. 26, (ed. R. Allkin and F. A. Bisby, pp. 17-33. Academic, London. Bisby, F.A. (19845). Automated taxonomic information systems. In Current concepts in plant taxonomy, Systematics Association Special Volume No. 25, (ed. V. H. Heywood and D. M. Moore, pp.301-22. Academic, London.

Pankhurst, R. J. (1988). Database

design for monographs

and floras.

Taxon, 37, 733-46.

Pellew, R. A. and Harrison, J. D. (1988). A global database on the status of biological diversity: the I.U.C.N. perspective. In Building databases for global science, (ed. H. Mounsey and R. F. Tomlinson), pp. 330-9. Taylor and Francis, London. Skov, F. (1989). Hypertaxonomy—a new computer tool for revisional work. Taxon, 38, 582-90.

21.

Linking related databases: a microbiological approach B. KIRSOP Microbial Strain Data Network, Institute of Biotechnology, Cambridge University, 307 Huntingdon Road, Cambridge CB3 0JX, UK

Abstract

The Microbial Strain Data Network (MSDN) is an information and communications network set up primarily, but not exclusively, for microbiologists and biotechnologists. It is sponsored by the International Council of Scientific Unions. It is an independent organization with a Secretariat in Cambridge. The network uses the telecommunications services of Telecom Gold, BT North America and Internet and currently has issued nearly 400 mailboxes. The information relates in the main to microorganisms and cultured cells, and is mostly in the form of catalogues, strain databases and directories. These and other bibliographic, nomenclatural, regulatory and commercial databases are linked to the network. Some databases are stored on the MSDN computers and use the same software; others are accessed through electronic gateways, the data remaining on a remote computer.

All

may

be

accessed

from

the

MSDN

database

menu,

so

providing a one-stop shopping service to a diverse collection of data resources. Database providers use the MSDN facilities on exactly the same basis as the MSDN itself; the MSDN makes no additional charge. Data providers may add a surcharge to the use of their database or not, depending on their financial policy. The advantage of the MSDN approach is that the scientific community feels no competitive threat from collaborating in developing this international resource. With the communications facilities (electronic mail, fax, telex, bulletin board), the UNEP-sponsored training courses and software © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 219-25. Oxford University Press, Oxford.

220

B. Kirsop

distribution, the MSDN is providing a comprehensive service that appears to be a widely acceptable model. Introduction

The technical developments in biotechnology and the international concern about environmental issues have accelerated the need for a comprehensive information network for microbiologists and biotechnologists. Scientists throughout the world need to find organisms with specific properties. Sometimes only the information is required, sometimes a culture for research, teaching, or industrial purposes. To encourage worldwide collaboration in such an initiative, an internationally accessible and economically acceptable communications mechanism is an essential adjunct. Furthermore, training must form a major part of the structure if the scientific community is to realize its full potential. The needs and specifications for such a network were drawn up in 1985 at a workshop sponsored by the United Nations Environment Programme (UNEP) and the Commission of the European Communities (CEC) (Hill and Krichevsky 1985). Working groups were set up to begin to implement the specifications and funds were sought. In 1986 sufficient support had been obtained to employ and train an information officer and in 1987 a Secretariat was established in Cambridge and an Executive Director appointed. Structure

The network obtained sponsorship from three of the International Council of Scientific Unions’ Committees (the International Union of Microbiological Societies, the Committee on Data for Science and Technology, and the World Federation for Culture Collections). Recently the Committee for Biotechnology has also agreed to sponsor the network. Although the MSDN felt that a trust structure was appropriate to its aims, the difficulties to be overcome in obtaining the required charitable status were likely to be too great. The MSDN was therefore set up as a company limited by guarantee. The Articles of Association state that the proceeds of the MSDN must be directed solely to the development of the network, that the ‘Directors’ (Committee of Management) must receive no reimbursement and that should the MSDN cease to function the services that have been built must be passed to an international scientific organization with like aims. The Committee of Management (‘Directors’) has been set up with international membership, the members being drawn from the Membership (‘share holders’) of the MSDN itself. The MSDN has sought financial support and has obtained grants from

Linking databases—a microbiological approach

221

the CEC (under the BAP and BRIDGE programmes), the National Science Foundation, Environment Protection Agency and the National Institute of Dental Research in the USA, UNEP and UNESCO.

Substan-

tial support in kind has been provided by the parent organizations of the many scientists throughout the world working to establish the MSDN as an international resource. The Secretariat of the MSDN is based within the University of Cambridge’s Institute of Biotechnology, but remains an independent body. Three paid employees administer the network. Technical needs

Any telecommunications system adopted by the MSDN had to meet the following needs:

(1) accessible worldwide; (2) accessible to scientists from academic, industrial, and administrative organizations; (3) communications and database services available; (4) economically acceptible to users from developing nations; (5) reliable and providing technical support. The BT Tymnet/Telecom Gold services were adopted since they appeared to meet these requirements, and experience has shown that the choice was appropriate. Small surcharges on usage are added both by MSDN and some (but not all) of the database providers. Income from this helps defray some of the MSDN’s administrative costs. Present services

In the first three years after the Secretariat was established nearly 400 mail boxes (representing an unknown number of users) were issued to scientists and administrators from 30 countries. The following services are now in place:

(1) access to MSDN Central Directory (leading searchers to centres holding information on specific microbial or cultured cell properties); (2) access to other related databases (see below); (3) on-line culture ordering service from major culture collections; (4) electronic mail (including telex and fax);* (5) bulletin board;* (6) computer conferencing;* (7) directory of 400 microbiologists worldwide; (8) MICROIS software (database management system for microbiologists); (9) access to specialized microbiological services;

222

B. Kirsop

(10) training, courses or individual training; (11) access to Telecom Gold BT North America/Internet services;* (12) access to other Telecom Gold/BT North America systems.* Asterisks refer to note on p.225.

A number of database producers have asked to become linked to the network so that international access can be provided easily. Most are microbiological; others relate to linked disciplines and may even request a separate ‘identity’ within the network. The databases are stored on remote computers and are linked by electronic gateways. Databases stored on the network computers incur monthly storage costs and share a common software. Remote databases need only pay a single charge for setting up the gateway and will continue to use the software already in operation; clearly, no storage costs are involved. Databases currently accessible through the MSDN network are: 1. MSDN Central Directory of laboratories or data centres with information on properties of microorganisms or cultured cells. 2. Hybridoma data bank on commercially available cloned cell lines and their immuno products. 3. Information Centre for European Culture Collections (MiCIS) database with primary data on strains held in UK service culture collections;

4. . 6. Nn

7.

8.

9. 10. 11. 12.

Culture Collection of Algae and Protozoa database; contact information on other European culture collections). National Collection of Yeast Cultures and National Collection of Food Bacteria on-line services. Netherlands Culture Collections’ databases (CBS/NCC). Deutsche Sammlung von Mikroorganismen databases, including Approved List of Bacterial Names. World Data Center for Collections of Microorganisms (RIKEN, Japan), including worldwide culture collection information and species list of holdings, the Hybridoma Data Bank, the World Directory of Algae, bibliographic information on plant tissue-culture research. DATA-STAR databases, including bibliographic information such as Chemical Abstracts or SCISEARCH (for Current Contents), medical and commercial databases. A discount is given to MSDN users. French databases on lactic bacteria and filamentous fungi. Tropical databases (Brazil; collections, catalogue, etc.). BioIndustry Association databases (regulatory issues, export and contact information, diary of events, etc.) IRRO (Information Resource on Release of Organisms) including

BIOTRACK (OECD database on environmental release of GMO’s), and BIOCAT (Database on insect control). 13. Biotech Knowledge Sources database.

Linking databases—a microbiological approach

223

14. Catalogues: ATCC collection of bacteria;

ATCC collection of algae and protozoa; ATCC collection of animal cell lines; ATCC recombinant clones and libraries;

CAB International Mycological Institute; Three Czech catalogues of microorganisms. All databases are accessible through a simple database menu. All are described in the instruction manual available from the Secretariat. The manual includes descriptions, instructions for searching, and an example of each database; a demonstration disk is also available.

The MSDN thus provides a one-stop shopping service to the microbiology and biotechnology community. It is possible to search the Directory for sources

of information,

access

one

of the

databases,

search

the

bibliographic database for any research data in the scientific literature, leave electronic mail messages or order cultures, all within the system. This removes the need to find out how to access different systems and obtain independent logon procedures. Database providers use the system on exactly the same basis as does the MSDN itself. The MSDN imposes no additional costs on collaborating scientists, only benefitting from the increased usage of the system and the ensuing royalties. If they wish, database providers may add surcharges on the use of their databases to help defray storage costs.

The MSDN Directory

The Directory that leads users to laboratories and information resources holding the required data plays a key role in the MSDN system. Although it is a referral system, it is important to note that this does not infer superficiality. On the contrary, the Directory is very scientifically exact. It locates, for example, not only laboratories that hold information on

yeasts that use

arabinose,

but also on

yeasts that ‘use’, ‘oxidize’,

‘reduce’, ‘hydrolyse’, ‘produce acid from’, ‘produce gas from’ either the D- or L- isomer of arabinose. Other features are equally precisely defined. The Directory uses a numerical system to record the standard definitions of microbiological features and uses the codes as search terms to locate the laboratories holding them. The system is known as the RKC coding system (Rogosa et al. 1986), which is becoming widely accepted as an international standard for the exchange of microbiological data. The Directory only records the kinds of features measured in a laboratory; it does not record the results obtained from the measurements. Collaborating scientists only need to send the MSDN a list of tests

224

B. Kirsop

performed in their laboratory and collaboration is thus very simple. These tests will be coded and, after checking that they have been correctly interpreted and the correct codes assigned, the data are put into the computer. The coded features, together with the contact information on the laboratory, form a record that may be searched by individuals looking for specific information. Collaborating scientists are under no obligation to provide cultures, but only information. There is similarly no requirement that data are computerized. Usage

Usage of the system is increasing steadily as more databases and services become linked to the system and the scientific community becomes aware of the advantages of the services provided by the MSDN. The greatest usage in terms of connect time is still from the developed world where on-line systems are used more routinely. However, the MSDN is very aware that to the smaller numbers of users from developing nations the system offers a direct route to the international scientific community and to easily obtained

scientific data. In Brazil, for example,

the link

to MSDN has stimulated a number of activities in the area of culture collections and environmental microbiology and biotechnology, and has enabled collaborative projects to become a reality. To countries with unreliable communications, electronic mail is not a ‘toy’ but a life-line. The

use

of the databases

is variable,

both

between

databases

and

seasonally. Many of the databases contain data that are needed only occasionally (catalogues, MiCIS, HDB, etc.); others are more frequently searched (Data-Star, BioIndustry databases). The data obtained from the former databases is, nevertheless, of high value and is often the key to a new biotechnological development. The training courses and workshops held by the MSDN around the world (Russia, USA, Czech Republic, Brazil, Guatemala, Egypt, UK, Spain, Germany) have undoubtedly played a major role in stimulating use of the system. Courses leave behind a body of trained and aware scientists, willing to work in developing the network further. Future

The MSDN plans now to establish regional nodes that can act as local support offices, giving training and encouraging the development of new databases, and providing local language support. Agreement has been reached to set up such centres to support the north, south, and central American countries, Russia and east and central European countries, and

Linking databases—a microbiological approach

225

also nations that are islands with very specialized data and communications needs. Funds are currently being sought for these initiatives or, in the case of Russia, have been allocated for initial developments. There is growing awareness that as the number of linked databases increases, the difficulties in using them efficiently will also increase. Although at present the network only requires knowledge of four software programes, this number will certainly grow in parallel with the network. The development of front ends and intelligent interfaces will become a priority, aiding users in overcoming different searching procedures. It seems that the MSDN has been welcomed as an appropriate system for an international information and communications resource. It is scientifically, economically, and politically acceptable, allowing scientists to design, build, and control their databases and services independently, yet making it internationally available through the MSDN menu. It is anticipated that other scientific groups will wish to take advantage of the service now established, thus contributing to the development of an increasingly valuable resource. References

Hill, L.R. and Krichevsky, M.I. (ed.) (1985). Needs and specifications for an International Microbial Strain Data Network: proceedings of a workshop held in Brussels, Belgium, 15-17 November 1983. United Nations Environment Programme, Nairobi. Rogosa, M, Krichevsky, M.I., and Colwell, R.R. (1986). Coding microbiological data for computers. Springer, New York. *(Editor’s footnote]: There have been a number of changes to the MSDN since this chapter was written. In particular the services are now largely integrated within the Internet (see Roubicek, Chapter 24, this volume).

22.

Adopting a transaction processing model for a global plant species information system DEIDRIE JINKS School of Computing Sciences, University of Technology, Sydney, P.O. Box 123, Broadway, NSW, 2007, Australia

Abstract

Current transaction processing systems typically are large commercial or government applications, allowing multiple users, who may be geographically remote, to access data simultaneously. Such systems are found, for example, at the heart of large banking or telecommunications enterprises. Transaction processing interacts with the network, the database, and the physical machine on behalf of the application, but is technically within the field of communications, residing within the application layer of the OSI reference model. System requirements are examined and transaction processing is defined. Implementation of transaction processing is discussed within the context of current international standards and available technology, together with the special problems inherent in applying a transaction processing model to a global plant species information system.

Introduction

Transaction processing (TP) unifies a distributed information system, providing an open, standard user interface and systems management mechanism. Heterogeneous databases, networks, operating systems, and physical machines are the components integrated to support reliably an application implemented as a distributed information system. Application owners may specify system characteristics, for example access authority, © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 226-39. Oxford University Press, Oxford

Transaction processing model for GPSIS

22i

but the underlying technical mechanism is implemented and managed by the transaction-processing software. A user protocol independent of the underlying technology (an application layer interface, in OSI* terms), and rigorous attention to reliability,

security, integrity, and ease of authorized use are the strategic characteristics of TP that make it suitable for consideration as the underlying system architecture for a global plant species information system (GPSIS). GPSIS requirements

Taxonomy is a modern science, drawing data from a wide range of sources including the fields of biochemistry, molecular biology, chemosystematics, cytology, electron microscopy, genetics, statistics, and taxometrics to complement traditional data from morphology and anatomy. Controversy will undoubtedly continue over how to interpret this increasing variety of information and over the relative importance to be attached to the different kinds of data. Despite these uncertainities, however, taxonomy

and nomenclature will continue to be fundamental to the language of dialogue regarding botanical and environmental issues. The motivation for GPSIS is the need expressed by users of botanical data for access to data. Two key requirements—a mechanism for reconciling differing nomenclature and classification systems, and access to the most up-to-date information, globally—imply a fairly sophisticated computer system accessible through a network. A catalogue—as unambiguous as possible, and of the highest possible taxonomic integrity—of the world’s plants, through which other relevant data could be accessed, would form the backbone of such a system. Relatively small numbers of datay, say about half a million entries, would

suffice for this basic catalogue—the taxonomic backbone—thus it could be made available in a variety of forms. However, the form of the taxonomic backbone will determine the feasibility of appending other relevant data, and the mechanisms whereby this would be possible. Regardless of the computer architecture chosen for the implementation of the taxonomic backbone, subsets of the information will be required in a variety of formats, depending upon specific usage: for example, brief publications containing a summary of taxa or species within a specific classification system for a specific geographical area; periodic release (or updates) of the full

* Open systems interconnection (OSI) reference model, which describes TP as a communications protocol, is discussed in ‘Current standards’, below. For a technical introduction

to OSI see Stamper (1986) or Tanenbaum (1988). + In commercial systems terms, a small public utility servicing a city the size of, say, Adelaide, would have similar data storage requirements.

228

D. Jinks

taxonomic backbone as publications on paper or CD-ROM; on-line access for enquiry between periodic releases, on-line access for data connected to the backbone, or on-line access for update to the taxonomic backbone or related data. Quite complex queries can be imagined: for example, a request for a list of species of plants which produce a particular chemical, along with images of appropriate specimens, and distribution changes over time, mapped and correlated with the equivalent remote-sensing dataset. Potential users of the system vary enormously in terms of the spatial and temporal scales of the data in which they have an interest. Most definitely, different databases (or different areas of the same database, or a set of separate distributed databases) will be required to hold information at different levels of detail, ranging from an image of a particular specimen, to the distribution of a particular species, to periodic remote sensing of global vegetation cover. Whilst the taxonomic backbone could not contain all data at every level of detail, it must provide a mechanism to locate and access such data. It is therefore of enormous importance that the system architecture chosen should accommodate such differing styles (text, image, graphics), quantities (single description versus remote sensing) and scales (location of single specimen versus changes in species distribution over time) of data, both for data storage, and for data transfer. As the system matures, additional users of botanical data will surely be identified, so the

system architecture must not preclude expansion. GPSIS must make sensible use of existing resources. As well as various published sources (floras, monographs, national and local censuses) there is an immense wealth of information waiting to be exhumed from herbarium collections. There are also many independent computer systems, representing a broad range of hardware and software platforms, and custom-written applications, such as computer-aided specimen identification, numerical taxonomy, and herbarium label systems. All of these represent a substantial investment by the community, and it would be unforgivably wasteful not to make this vast quantity of botanical data accessible to analysis by the many researchers working on urgent environmental problems. The GPSIS architecture must, in so far as it is possible, incorporate existing assets. Immediate widespread evaluation and adoption of available standards, such as those published by Taxonomic Databases Working Group, will assist this process. Technology is available to assist in conversion of data to electronic form; optical character recognition scanning (OCR) can be used for paper-based textual data, whilst imaging technology can capture the visual aspects of physical specimens. The system should be cost-effective, available globally without placing an undue burden on contributors or users of data. Basic access should require nothing more complex than an inexpensive personal computer, and a telephone connection. The underlying technology should be as

Transaction processing model for GPSIS

pts

transparent as possible, and a mechanism must be available to provide sensible, cost-effective advice and assistance where required. Ownership of, responsibility for, and priority and validity of data must be clearly specified, and given that users of the system will be from both the academic and commercial worlds, data sharing, especially across national borders, must be negotiated between owners of the data. The mechanisms whereby data are shared should be a securely managed aspect of the system. GPSIS must be patently non-exploitative. Even before GPSIS is available, all parties (users and contributors of data and services) must make equitable agreements on the ownership of commercially valuable developments made as a result of access to the data. Resources required to realize such a system would be of the order of $Aust3-—S million per year, no more than the data-processing budget of any one of the many medium size commercial enterprises that are currently adopting TP for their corporate systems. This is really a minuscule amount of money when compared to the benefits that would accrue not only from huge savings in direct cost of research, but also from the concomitant results of more rational management of botanical biodiversity. Although not-forprofit, the system would be cost-recovering to some extent, given the value of such data to commercial, industrial, and government enterprises. Urgency expressed by users of botanical data for an unambiguous botanical reference must be heeded. Developers of GPSIS should have a first release of the system available within the realistic time frame of three to five years. It should be possible to-have a pilot version available much sooner. GPSIS must be viewed as a permanent system, and therefore rigorously ‘software engineered’, so that over time the underlying technology can be updated without disruption to the system as a whole. In summary, management of the system must ensure that GPSIS is reliable (predictably available), secure (protecting ownership and datasharing agreements, and protecting also against data loss), of the highest possible integrity, and available to users with a wide range of technological sophistication. The TP model

A transaction processing model can be applied to a global plant species information system most effectively, providing a secure, accountable gateway into the network to users and owners of databases, whether those databases are held at a single central site, or distributed across many sites. The model provides for a range of network protocols (for text, image, cadastral, or mapping data), and deals with a range of database formats. TP manages the underlying technical components of a distributed

230

D. Jinks

information system. An important concept is client-server. Clients request services, such as information storage, retrieval, or processing, from servers. Local TP nodes can process transactions if local resources are adequate; if further resources are required, a transaction can be assembled and sent to a server node to request the information (or processing) required. Conversely, complex transactions can be disassembled and the parts sent to multiple servers. A TP node can act as either client or server, depending upon the specific task in progress. The unit of work is the transaction, a bound* task defined by the application, which exhibits the ‘ACID’ properties of: 1. Atomicity—either the entire transaction is processed, or no part of it at all. 2. Consistency—the transaction is processed once and only once, in a manner which is accurate, correct and valid in terms of the application semantics; that is, the system is transformed from one consistent state into another consistent state in an entirely predictable and definable manner. 3. Isolation—intermediate results which occur during the processing of a transaction are not available to other transactions. 4. Durability—state changes made by a completed transaction are per-

manent, notwithstanding failures of any component within the system. Any person who has used an automatic teller machine to withdraw money has initiated a transaction to update (or change the state of) the bank’s database. A botanical system may have transactions available, for example, to add a new species, or to record details of a newly arrived specimen. Transactions request the performance of discrete units of work. They emerge asynchronously from and may be processed at any node, and may change the state of the database held at any node within the system. Transactions can be conceptualized as messages flowing between system components, under the management of the TP software. The TP software can be adapted to the application requirements by being centralized at one location, distributed between co-operating nodes, or organized at some point between these two extremes. Transactions can be processed serially, if requiring the same set of permanent data on which to operate, or in parallel, if operating on different data. In practice, this decision depends upon the granularity of * Bound data are defined in the International Standard ISO/DIS 9804, and refer to a specific subset of the database upon which operations may occur during the processing of a specific transaction type. A bound task is one which operates only on bound data.

Transaction processing model for GPSIS

Zo

the locking mechanism* of the database concerned, and is transparent to the application process invoked. TP is most often used for mission-critical systems (those systems on which the survival of the enterprise depends) with requirements for accountable management and reliability. One key advantage is that the structure is infinitely scalable, that is, the system can be expanded in a modular fashion to cater for an increase in transaction rate, or to cater

for additional user requirements, without losing the basic structure and manageability of the system. TP components

1. User interface TP systems usually interact with end users through a ‘forms’ interface. That is, a picture of a form (which may be identical to the equivalent paper-based form) is presented to the user who then follows the instructions (prompts) from the computer to complete the information required. Different forms are used for navigation through a menu structure, for requesting information (enquiries) or for adding information to the system (updates). Often forms are processed locally, if local resources permit. Features such as a ‘mouse’, or ‘windows’ interface are available if required by the application. The ‘user friendliness’ of the TP interface is determined by the application designers, who may choose to make a different interface available to casual, as opposed to experienced, users. In current systems, on-line help and tutorial facilities would be expected.

2. Application programs Application programs do the work that interests us as botanists; TP isolates the user’s view of the application from the technical implementation. The application programs which process user transactions are independent self-contained modules enveloped within a framework of specific TP operations required to guarantee maintenance of the ACID properties. These operations include commitment (make changes permanent), abort-rollback (remove changes upon error), dequeuing (finalizing) the transaction, and often security, authorization, error recovery, and logging procedures. The programming language in which the application is written is of no consequence to TP, and it is common for a system to use a variety of languages, selected for the characteristics of the problem at hand and the skills of the developers. * Fine-grained locking systems can reserve small parts of the database (record, or parts of a record) for exclusive use of the current transaction. Coarser locking may be at the file

(or even entire database) level.

po.

D. Jinks

3. Database

Commonly, the term complex) set of files management system sense to include any

database is used to refer to an interrelated (often organized and maintained by a specific database (DBMS), but here the term is used in its widest DBMS. Traditional text and numerical databases

may be relational, network, hierarchical, or flat file systems; more recent

database systems contain multimedia data—image, digitized sound, and so on. Graphic styles of data are also accommodated in databases supporting geographical information system (GIS), or computer-aided design (CAD) applications. In the literature, transaction processing is often discussed in terms of database systems,

and there is a view that TP does not, in fact, exist

independently. It is true that some individual database management systems support the concept of transactions and employ similar mechanisms to support database integrity; however, this view of TP is incomplete as it does not acknowledge the possibility of multiple database styles, nor the pre-eminence of data communications and overall systems management in a distributed systems environment. In a straightforward application, the transaction support provided by a single DBMS may be adequate but in a situation involving different types of DBMS, on disparate computer systems, with different owners, a separate TP is required for integration. TP interacts with each database on behalf of the application and performs queuing and restart/recovery operations co-operatively, the exact mechanism depending upon the DBMS involved. Integrity control is also negotiated between TP and the DBMS. For example, a DBMS which utilizes the two-phase commit protocol (see Leu and Bhargava 1988, for a technical explanation) for distributed updates can be implemented as a component of TP which treats each sequential portion of this protocol as a single-phase transaction and which provides transaction logging facilities to allow the database to recover should the two-phase commit fail. Alternately, the DBMS can rely entirely on TP distributed recovery mechanisms. TP also has its own internal database system to store the information required for system management (users’ access permissions, varying security levels) and integration (network addresses, device characteristics). Accounting and auditing data are also accumulated. 4. Communications network

A network can consist of any number of nodes—devices ranging from PCs to supercomputers—along with the links between them, which may use a variety of physical media: satellite, optical fibre, co-axial cable, microwave, twisted copper wire pairs, or even mains electricity wiring. The data flowing through the network can include text (descriptive and indexing

Transaction processing model for GPSIS

233

information), image (a picture of a flower or a DNA fingerprint), graphics data (GIS or CAD), in fact anything that the available databases can store. The sets of rules which govern the exchange of data—protocols—are transparent to the user. TP can manage its own private data network, or can utilize the telephone network, ISDN, the public packet switched network, or academic and research networks. If one is prepared to sacrifice response time, transactions can flow across many different networks on their journey from client to server in a distributed TP system. 5. Operating system

In any computer system, the operating system ‘owns’ the hardware: it allocates and controls physical resources (memory, central processing unit, file storage), schedules and controls execution of jobs, and provides a wide variety of utilities and services (such as job accounting and statistics, and file system management). Although TP has a close relationship with the operating system, it may be only one of a number of jobs (which may or may not be other TP systems) sharing the physical machine at any one time, under the supervision of the operating system. TP particularly depends upon a stable mechanism for non-volatile storage of transactions in progress; this may be implemented within the TP system itself or it may rely on services provided by the underlying operating system. Modern operating systems are increasingly oriented towards TP, and usually provide a degree of queue management. System integration is essentially a queuing problem and a successful TP system relies absolutely on implementation of a mathematically rigorous queuing and logging mechanism. For example, queues must be maintained for the database, the network, application processes and physical devices and also for users who have different access authorities and processing priorities. Current technology

Whilst TP system have been very common in industry for many years, there has been little treatment of this style of computing in the academic literature. TP is really the integration of many areas into practical working systems, and most work in this area has either been very practical, addressing itself to specific applied problems, or not published for reasons of government security or commercial confidentiality. Within the literature the terms transaction processing (TP), teleprocessing, on-line transaction processing (OLTP), transaction processing monitors, distributed trans-

action processing (DTP), or even database/data control (DB/DC), are basically all synonyms. The term real-time is sometimes used, but often

234

D. Jinks

refers to similar software in process control or military command and control or guidance systems. Current transaction processing systems typically are large commercial or government applications; they are found, for instance, at the heart of banking, telecommunications, and surveillance systems—in fact any place where large mission-critical applications must be accessible to multiple users of data, simultaneously. These applications have stringent requirements for security, reliability, and accountability. Traditionally, transaction-based systems have been the preserve of the very large sophisticated computer users, with a correspondingly large data processing budget, specifically ‘transaction-heavy markets . . . travel, retail merchandising, banking and insurance’ (Sivula 1990). Systems of over 100 000 user terminals supporting thousands of transactions per second, are not unknown (Bernstein 1990). As TP systems become available on less expensive machines, as networks become more accessible, and as the demand for global interchange of electronic data has increased, transaction processing has become more widely used, capitalizing on TP’s main strength—that information can be continually updated as events occur, with the results immediately available across the network. TP currently represents 25% of the computer systems market (Bernstein 1990) and is the largest growing segment within the computing industry, with a current compounded annual growth rate of 13 per cent (Davis 1990). Within the United States, the TP market in 1988 was worth $US34.8 billion, projected to reach $US67.3 billion by 1994 (Davis 1990). Transaction processing is a stable, mature, accessible technology, available from a wide range of software suppliers. In the late 1960s and early 1970s, users requiring TP wrote their own software, usually as an additional layer of operating system. Today, a ‘shell’ is provided by any one of a number of software suppliers, which is then customized to the needs of the particular application to be supported. Customization is often by way of data structures linking transactions to specific application processes, identifying users and their security and authority levels, and describing the physical characteristics of the network, database, and physical devices with which TP will interact. In the worst case, customization entails writing code to interface with components required by the system but not provided in the purchased shell. Customizing a TP monitor is not a trivial task and the effort required varies enormously, depending upon the specific product purchased, as does the level of support for various interfaces. Some systems allow limited changes to the customization on-line, so the characteristics of the system can be changed without a pause in operation. Currently there are many efforts to embed the inferencing capabilities of today’s expert system shells within mainframe transaction processing

Transaction processing model for GPSIS

235

(Popolizio and Cappelli 1989). This is of particular relevance to GPSIS (see GPSIS design issues below). TP is database and communications intensive: most transactions normally require relatively little calculation. Primitive TP systems concentrated all of these functions within a large mainframe. These evolved into more sophisticated hardware platforms consisting of mainframes with associated front-end network processors and back-end database machines. More recently, a wider range of architectures has become available, and minicomputers and even the more sophisticated personal computers can function as TP nodes. Current TP systems deal with many different devices, and most of the available high-tech ‘toys’ can be accommodated—CD-ROM;;; optical laser WORM (write once, read many) disks; bar code, magnetic stripe or smart card readers; automatic teller machines; electronic sensors and so

on. Because TP usually supports commercial applications, there has not been a requirement to interface to laboratory devices such as scanning electron microscopes or electronic biochemical assaying devices; however, this would theoretically be quite straightforward, especially given that many of these devices already interface to PCs. When determining the system architecture for an application with the potential for communications, it is important that developers aim to meet at the same supporting technical interface (TP), so that internetworking becomes no more complicated than making a phone call. Developing an application under TP should be supported by CASE (computer-aided software engineering) and other development tools which would facilitate and promote standard interfaces to other TP components. Current standards

Communications and computer networks are usually discussed in terms of the International Standards Organization (ISO) open systems interconnection model—ISO/OSI

model, or the OSI reference model—which

since its establishment at a meeting in Australia in 1978 (Lions 1990), has undergone continual refinement. ‘Open systems’ are seen as desirable to allow communications between disparate computer systems (proprietary systems produced by different manufacturers, and applications developed by diverse organizations). The model describes communications between ‘peers’, that is, no control or hierarchy is implied (but can be imposed if appropriate to the application). Transaction processing, or distributed transaction processing, is defined as an application layer protocol within the OSI reference model and referred

to as OSI-TP.

the interactions between

The

technical

details

of the OSI

model,

and

‘the various layers, are of no interest to the

236

D. Jinks

applications designer or developer, whose attention should focus primarily on the external user interface, the supporting technical interface being predetermined by the overall system architecture. Botanists are not in the business of defining data communications standards; the ISO has expended much effort in this area within the international communications community, and it seems sensible to adopt appropriate international standards, especially given that both corporations and governments are moving from proprietary to open systems. Performance is a key issue with many commercial TP systems, and a Transaction Processing Performance Council has recently been formed by a group of computer hardware manufacturers to develop standard ways of benchmarking, or testing the performance of OLTP systems (Francis 1988), so that software from various suppliers can be compared. Mature TP systems interfacing to OSI compliant network software are available from suppliers with a strong commitment to open systems.* GPSIS design issues

Individual components of the GPSIS system can utilize mature technology. One ambitious aspect of this project is to integrate the available technology into a stable, reliable system accessible to people without a sophisticated computing and communications infrastructure or extensive computing experience. For example, systems currently interact through networks— private networks, the telephone networks, academic and research networks and so on; users within these different networks will have varying requirements in terms of quality and quantity of data flow to be supported. The challenge is for one system to interact with any set of these networks, transparently. Research being conducted in the intelligent network field will have a contribution to make here, facilitating navigation within the network and conversions between different data formats. Database design will present another challenge (for an introduction to database design see McFadden and Hoffer, 1985). Botanical data consists

of facts (observations, specimens) and opinions (interpretation of fact available to date). The name of a plant is derived from its taxonomic classification, which expresses an opinion on the facts available regarding this, and other related plants. As further information becomes available,

classification, hence name, may be revised. The botanical naming system is an information retrieval system in its own right, a situation not common in the commercial world. Within GPSIS, a number of competing classification

schemes will need to be accommodated and facilities made available for the * Bull, Control software.

Data,

DEC,

IBM,

Stratus,

Tandem,

and

many

others,

supply

such

Transaction processing model for GPSIS

Zon

hypothesis testing for new classifications against current data. Because of the nomenclatural instability, historical data (changes of name and classification) are required, together with allowances for homonyms, synonyms, accession, authority, type, and so on. Successful implementation of GPSIS will rely on usable mechanisms for navigation through the catalogue to arrive at the particular species, or meaningful group of species, of interest. Ignoring this complex problem by basing the system on a single current classification system would be short-sighted, especially given that entries can be indexed simply by using an account number, accession number or equivalent. Artificial Intelligence (AI), with its adaptive strategies and allowance for uncertainty, could be applied to the resolution of differing nomenclature and classification, for example by using botanical keys. AI can also be invaluable as a rule-based documentation tool for taxonomic classification systems. This strategy could be incorporated into a TP architecture, perhaps by locating an AI ‘shell’ at the local TP node, with more powerful inferencing capability at the server nodes. Research into natural language processing could well assist this endeavour if application semantics could be defined appropriately. Special problems ‘Information systems are not technical systems . . . (but) social systems which rely to an increasing extent on information technology’ (Land and Hirschheim 1983). It is therefore important to examine the socio-political context in which GPSIS will operate, to determine the specific requirements that result. The very name of GPSIS denotes that the system is global in nature, both in the sense of its botanical scope (a list of all the world’s plants) and in the sense of accessibility (the data should be available to all, at any location, at a reasonable cost). Any substantial computer system confronts ethical issues. TP is the classic ‘big brother’ software; it seems possible to support any fantastic application, given the resources. The surveillance and financial industries have long had the resources to take advantage of this characteristic, and there

are some very manipulative and exploitative social systems supported by this technology. Exploitation of biodiversity is of fundamental importance, and GPSIS must consider national sovereignty over genetic resources, adopting a structure that is demonstrably non-exploitative, especially given that the areas of the world with the richest botanical biodiversity, that can potentially make the largest contribution to GPSIS in terms of data, are also those areas with fewest resources to allocate to their formal taxonomic study. These issues are compounded by the fact that many botanic data are controlled by countries remote from the source of collection. GPSIS will not be successful if it blindly reinforces current global inequities.

238

D. Jinks

Policy direction in the above areas will influence purely ‘technical’ decisions, such as choice of terminals to be supported. For example, a user with a ‘dumb’ terminal connected through the academic networks is obviously at an enormous disadvantage compared to a user with a sophisticated high-resolution colour graphics multitasking workstation, with an expert system shell for the resolution of nomenclature, a local copy of the taxonomic backbone summary on CD ROM, multiple optical laser WORM disks containing images of specimens, the workstation also operating as a local TP node, connected to GPSIS through ISDN to obtain recent information not stored locally. Inequities can be institutionalized by computer systems.

Once GPSIS is operational, there is scope for data contained within it to be used for a wide range of political and economic decision-making; for example, policy formulation, land management, treaty verification, conservation boundary (priority) determination, germplasm resource and species management, biodiversity, and pollution monitoring. These topics concern every level of government and many sectors of industry, and as more countries enact environmental protection legislation the demand from commercial and industrial enterprises will increase. Data contained within GPSIS therefore, must be objectively validated, with strict security over updates to sensitive areas. Adopting a computer architecture modelled on that commonly used within commerce will, in many ways, legitimize calculations of environmental costs, and in the long term systems of this nature may supply organizations with environmental data to complement the merely monetary data used for business decisions today. What to do next

At this stage it is premature to propose a detailed technical design for a global plant species information system. Firstly, a framework must be established for the project, which would define the overall aims and objectives, ownership and funding of the system. A more detailed set of system objectives could then be compiled, and a set of users identified. It would then be possible to ‘identify and prioritize a portfolio of applications’ (Davis 1987). Requirements should be examined for completeness, relevance, importance, coherence, and feasibility (Valusek and Fryback 1987). Only at this point would it be reasonable to determine the overall system architecture. Following this, a strategy should be developed for the implementation of the system, which would specify the methodology to be used for database,

network

and application requirements

determination,

design,

and testing. A survey of existing assets which may be incorporated into the system would be undertaken. It would be appropriate to consider

Transaction processing model for GPSIS

FN)

the detailed technical design for the taxonomic backbone first, so that clear, well-defined boundaries and interfaces can be specified for related applications. A plan should then be formulated to allow orderly development of the system, at the same time making provision for the logistical problems of data conversion and system operations. Management of the completed system would be a substantial task, and consideration must be given to the establishment of a co-ordinating body, or secretariat, who would be responsible for operational issues such as determination of suitable computing hardware, provision of technical assistance to users, and assistance with the negotiation of data sharing agreements. Conclusions

Transaction processing has much in common with taxonomy, both being fields of synthesis which rely heavily on other areas for their information. TP could deliver an invaluable solution to the problem of global integration of access to botanic information. We will have to overcome the entrenched notion that biodiversity, generally, is a free, renewable resource not worthy

of investment. TP has a number of advantages which make it suitable as an overall system architecture for GPSIS. An ‘open’ distributed TP system will allow co-operation between disparate computer systems, taking advantage of the richness implicit in their diversity. Transaction processing allows rigorous attention to be paid to security, integrity and reliability, facilitates respect for national sovereignty and commercial confidence, utilizes existing systems, provides data sharing, and a framework for data sharing agreements, allows varying levels of technical sophistication depending upon user requirements, and provides access to the system through publicly accessible networks at reasonable cost. GPSIS does not need to own all the available data or applications, but will co-ordinate access between privately owned systems. In this way, GPSIS will provide access to a taxonomic backbone to which other relevant data can be associated. In essence, it will act as a reliable, objective switchboard

so that botanical data can be accessed through a global electronic taxonomic phone book. TP provides the only viable tool for the integration and management of such a complex system. References

Bernstein, P. A. (1990). Transaction processing monitors. Communications

of the ACM, 33, (11), 75-86. Davis, G. (1987). Strategies for information requirements determination.

240

D. Jinks

In Information analysis, (ed. R. Galliers), pp. 237-65. AddisonWesley, Sydney. Davis, L. (1990). On-line applications grow up. Datamation, 36, 61-3. Francis, T., (1988). OLTP delivers instant information. Bank Administration, 34, 37—40.

Land, F. and Hirschheim, R. (1983). Participative systems design: rationale, tools and techniques. Journal of Applied Systems Analysis, 10, 91-108. Leu, T., and Bhargava, B. (1988). Clarification of 2-phase locking in concurrent transaction processing. JEEE Transactions on Software Engineering, 14, 122-5. Lions, R. (1990). Open systems interconnection—are standards helpful? Australian Computing Conference, Gold Coast, Australia, September 1990, pp. 51-7. McFadden, F. R. and Hoffer, J. A. (1985). Data base management. Benjamin/Cummings, Menlo Park, CA.

Popolizio, J. and Cappelli, W., (1989). New shells for old iron. Datamation, 35, 41-8. Sivula, C. (1990). OLTP—duking it out on the high end. Datamation, 36, 22-5. Stamper, D. (1986). Business data communications. Benjamin/Cummings, Menlo Park, CA.

Tanenbaum,

A.

(1988).

Computer

networks,

2nd edn.

Prentice-Hall,

Englewood Cliffs, New Jersey.

Valusek, J. R. and Fryback, D. G. (1987). Information requirements determination: obstacles within, among and between participants. In Information analysis, (ed. R. Galliers), pp. 139-51. AddisonWesley, Sydney.

23.

Design aspects of an enterprise computing environment for systematics JAMES H. BEACH Department of Botany and Plant Pathology, Michigan State University, East Lansing, MI 48824-1312,

USA

Abstract

Networking provides the metaphor for the late twentieth century culture: it speaks of interactivity, decentralization, the layering of ideas from a multiplicity of sources. Networking is the provenance of far-reaching connectivity and, mediated,

accelerated,

and intensified by the computer,

it leads to the amplification of thought, enrichment of the imagination, both broader and deeper memory, and the extension of our human senses. Computer networking means the linking of person to person, mind to mind, memory to memory regardless of their dispersal in space and their dislocation in time (Ascott 1990).

The promise of the research networks

The research internet will reorganize modes of communication and methods of collaboration within organismic botany. High-capacity, inexpensive communication channels will create a demand for immediate access to botanical data, information, and knowledge. Will systematics institutions be able and willing to provide them? Researchers who work with network-connected computers have the capability to communicate with individuals and information sources on thousands of computers worldwide. Electronic mail is routinely passed around the globe within minutes, and for many network users it is a simple procedure to copy computer files across a continent or across an ocean within seconds. These services can greatly facilitate and accelerate research tasks. As an example, much of the reference material for this chapter © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 241-54. Oxford University Press, Oxford.

242

J. H. Beach

was identified through electronic mail correspondence and obtained from network information sources. More than 100 university library catalogues and campus information systems in seven countries are searchable at no cost by any network user (St George and Larsen 1991). Without leaving one’s desk, a researcher can rapidly determine which research libraries hold specific editions or particular volumes of botanical publications. From a desktop computer in our herbarium in Michigan, we are able to access freely a database of digitized colour images of Peruvian Nolanaceae that resides in a computer in the Jepson Herbarium in Berkeley, California. In real time, we can browse, display, and edit the plant images and then copy the results within seconds to our herbarium desktop machines for future reference. These communication services are ubiquitous at American colleges and universities, are becoming more common in other countries (LaQuey 1990; Frey and Adams 1990; Quarterman 1990), and represent the initial stages of network functionality. With the speed and capacity of the nets rising at a remarkable rate, multitudinous possibilities for communicating botanical information stored as data, text, and as visual imagery are on the near horizon (Arms 1990; IBM 1990). With an international communications infrastructure being laid to the doorstep of the discipline, systematics will have an excellent opportunity to improve research data communications among research centres and universities, as well as to develop new and broader audiences by delivering information about plants to elementary and secondary schools and to the public. A global systematics computing environment

Networked computing is a co-operative endeavour. Developing a global systematics computing environment, integrated by network communications, will necessitate active, technical collaboration between institutions.

Systematics will need to invest considerably more in information processing expertise, and we will require a much greater collective awareness of communication

standards,

network

capabilities,

and database

network-

interface options. There will be innumerable technical hurdles to the development of an integrated systematics computing environment, including the need for additional data standards (see below) and perplexing limitations of commercial technology for systematics data. Bisby (1984) discusses several of these constraints within the context of the capabilities needed by different classes of users. The requirement for more flexible representational structures to support temporal, spatial, image, sequence, graph, and other richly structured data is noted by French et al. (1990). Beach et al. (1993)

Enterprise computing environment for systematics

243

point out the difficulty of handling hierarchical taxon and nomenclatural information with current data models, while Pankhurst (1988) addresses related limitations for processing taxonomic character data. Of the various requirements for the development of an enterprise computing system, the following objectives are of critical importance: 1. Repositories of botanical information must be on machines with network connections. Single-user computers and isolated institutional systems can be effective for research data processing, but they cannot function as information servers for remote users. 2. Information systems must be designed for open access. Botanical databases which are not easily accessible over research networks are effectively private, proprietary holdings. Sequestered knowledge cannot contribute to the development and enrichment of the systematics enterprise, nor to its extension into broader education, research and conservation domains. In this context, ‘open access’ includes on-demand, 24-hour network access to data, with minimal administrative regulation or security challenges to remote users. Open-access designs do not preclude non-intrusive security mechanisms or use accounting. 3. Systematics projects need to be able to easily import and export botanical information in electronic form. Autonomous institutions will continue to pursue independent hardware and software solutions, based on local needs. It seems unlikely that the systematics community will be able (or motivated) to standardize, for example, database server software, database design, application development languages, or user interfaces within the foreseeable future. Designs for a global systematics computing environment will need to accommodate an eclectic diversity of institutional information processing strategies. Standardization of information exchange will be the key. It is difficult to overestimate the value of standard data exchange formats and exchange protocols for the development and stabilization of the enterprise computing environment. Standards specifying the way information is packaged, sent, and received on exchange will permit independent systems with customized internal designs to evolve and at the same time permit network access to them through a stable logical interface. The activities of the IUBS Commission for Plant Taxonomic Databases (TDWG) and other standards development organizations are of considerable importance for this reason.

Client/server systems Access to network information can be effected in various ways. GenBank distributes nucleic acid sequence updates as electronic mail (in the USENET group bionet.molbio.genbank.updates; see LaQuey 1990, p. 386). Library

244

J. H. Beach

catalogues are primarily accessed using TELNET network terminal emulation software that allows users to log onto remote collection database systems (St George and Larsen 1991). Anonymous logins using the ‘FTP’ protocol for direct file transfer are permitted on hundreds of Internet computers (Granrose 1990). BITNET has a widely used mass mailing list system ‘LISTSERV’ which additionally includes a mechanism that copies archived text or data files to users in response to mail queries (LaQuey 1990). The traditional and still prevalent form of access to large databases is one where users establish a terminal session and operate application programs on a multiuser computer. A person might log onto a machine that is nearby over a direct connection or to a computer at a remote location over a wide-area-net. However connected, the user processes data in exactly the same way, by interacting with menus, edit/entry, and report screens. Remote access to the database is exclusively through the host’s application programs. This is an example of ‘host/terminal’ database architecture. A technical characteristic of host/terminal systems is the logical cohesion between the database software that stores and manages data files, and the application programs that interact with it. Client/server database architecture, in contrast, uncouples the application programs from the database server software, and intercalates an additional logical layer to handle communication between the applications and the server, which is effected in part by a go-between, standard query language (Fig. 23.1). The importance of client/server architecture, in the context of network access to information, is that it allows the application programs and the database server software to reside on different machines. Because they communicate through structured messages, the conversation between application and server can be carried out between machines connected across the room or across the globe. A functional difference between the client/server and host/terminal database architectures has far-reaching consequences for access to botanical information and for the design of the systematics computing environment. In the host/terminal model, a remote user running a session from a terminal or from a computer emulating a terminal receives only screen images. Information is visually presented but there is no mechanism to capture formatted data records for local use. As a consequence, access to the remote information is essentially limited to the duration of the terminal session. A client/server database, in contrast, copies data records

in a standard exchange format to the remote user’s system. The records are then available indefinitely for local project use. 1. A network mail-based client/server system

In early 1990, we developed at Michigan State University (MSU), a prototype client/server specimen database system using the Ingres relational

Enterprise computing environment for systematics HOST / TERMINAL USER INTERFACE

APPLICATION PROGRAM Oe S

245

CLIENT / SERVER '

USER INTERFACE

|

APPLICATION PROGRAMS See —

E

DATABASE MANAGER

EY

: NETWORK SOFTWARE

WORK SOFTWARE

Fig. 23.1. Database system architecture.

database management system and network mail (Beach, 1992). A Digital Equipment Corporation VAX minicomputer located at the MSU Kellogg Biological Station (KBS) in Hickory Corners, Michigan functioned as the client. The database server machine, a Sun Microsystems workstation, was

located about 100 km away on the main MSU campus in East Lansing. An Ingres application at KBS recorded a query specification based on user selections from a query-by-form screen and then mailed the request to the database server which held 18 000 plant specimen records. The mail query was routed over a DECNET link to another DEC machine which then passed it to a segment of the TCP/IP Internet which carried it to the workstation. The Sun workstation periodically checked the contents of a mail directory; when it found an incoming message, an AWK script parsed its contents and then applied the query against the Ingres database. The resulting dataset was formatted and directed into a file which was mailed over the network back to the sender. When the answer dataset arrived, a

process on the client KBS VAX imported the records into local database tables and notified the user. Accession records returned to the client generally within 10 minutes; the elapsed time was largely constrained by the timing of intervening network mail processes.

246

J. H. Beach

We used electronic mail as a query vehicle but there are more sophisticated network approaches such as ‘connection oriented’ designs wherein client and server processes enter into a real-time network dialogue. For an example see NISO (1988). In this case, a precisely defined protocol specifies a predictable sequence of back-and-forth control and data messages which are used while a client application requests or receives information from a database server. This type of client/server system has several technical advantages, among them immediate responses from the server, additional user control for refining search specifications during a session and more effective error handling. 2. Client/server systems in other disciplines A movement academia.

toward client/server database systems is occurring within

Within the libraries, the largest client/server initiative is the

‘Linked Systems Project’ (LSP), a system used by the US Library of Congress to copy authority and bibliographic records to library consortia (Fenly and Wiggins 1988). Several American universities are implementing client/server database systems in the form of campus-wide information systems (CWISs). CWISs contain the usual types of campus information: course schedules, calendars

of campus

events,

faculty office hours, local weather

forecasts,

etc. Among the first of these systems are the Massachusetts Institute of Technology’s “TechInfo’ (T. McGovern, [email protected], personal communication), Dartmouth University’s ‘Dartmouth College Information System’ (Klemperer and Levine 1990; Brentrup 1990), the University of California, Berkeley ‘Network Information Server’ (Kunze 1990), and ‘Library Information System II’, at Carnegie Mellon University (Troll 1990). Standards

Standards are the key to ‘interoperability’ between institutional systems and for communication between database servers and client applications. Probably the only way for a co-operative computing environment to emerge in systematics will be through the standardization of the ‘exchange environment’. That would include data definitions and encoding, field and record formats, and exchange protocols. Despite the slow drudgery of standards development and the cost of developing compliant systems, standards are a necessary condition for the implementation of interacting information systems. The nice thing about standards is that there are so many to choose from. There are also multiple layers of standardization with information technology, from low-level communication protocols, to user-interface and

Enterprise computing environment for systematics

247

other implementation standards, to conceptual issues of data representation. Cargill (1989) surveyed national and international standards activities for information systems. Three standards are of particular relevance for networked systems in systematic biology.

I. ISO ASN.1 The International Standards Organization (ISO), Abstract Syntax Notation One or ‘ASN.1’ (ISO 1987) is used to describe abstractly data types and messages for interprocess communications on ISO open systems interconnection networks. The data objects defined by an ASN.1 syntax can be encoded and instantiated in various ways, for example in binary or ASCII, by using different encoding rule conventions (Gaudette 1989; Karp 1990). ASN.1 is an important standard for the design of biological data structures and client/server processes which would operate in an international (OSI) network computing environment. Ostell (1990) describes a project in molecular biology which utilizes ASN.1.

2s NISO(Z39.50 NISO Z39.50 is an American National Standard and a draft ISO standard for a connection-oriented network information retrieval service (NISO 1988). It specifies a protocol for sending queries to remote databases and for a client/server interaction between two machines. Z39.50 is likely to be popular for academic client/server systems. It is used by the Linked Systems Project (Fenly and Wiggins 1988), the UC Berkeley Network Information Server (Kunze 1990), Library Information System II at Carnegie Mellon University (Troll 1990) and other campus systems being developed (Lynch 1991).

3. MARC Libraries have pioneered the development of large, open-access, multiuser databases. Although information about textual materials has been the historical focus of library automation, other classes of objects are now cata-

logued into library systems, including maps, sound recordings, computer data files, musical scores, works of art, motion pictures, microforms, and

items the librarians refer to as ‘Realia’ (USMARC formats for bibliographic data 1988). The stability of information management in the library community is attributable to highly developed data standards and cataloguing rules, and an international infrastructure of standards committees. Libraries also have a remarkable attitude of common purpose which is manifest in the design of their national and international shared-cataloging database systems. MARC (MAchine-Readable Cataloguing) is at the heart of library automation. MARC is a set of standards for identifying, storing, and

248

J. H. Beach

communicating accession catalogue information (USMARC formats for bibliographic data 1988; Crawford 1984). Included in MARC are standard vocabularies and reference works for controlling descriptive terminology; data dictionary definitions for logically delimiting data entities; standard ‘tags’ for identifying fields within exchange data records; and an exchange record format specification for communicating data between systems. The content of MARC data fields is defined by related standards, most prominently, the Anglo-American cataloging rules (1988). The exchange record format specification complies with ISO Standard 2709 (ISO 1981) and it defines a structured self-describing, variable-length record. MARC records are used for importing and exporting records but not for storage; library systems might use any database model and file structure internally. Most large libraries maintain a database of accessions in an on-line public access catalogue or ‘OPAC’. In addition to creating records for new acquisitions in-house, libraries also obtain catalogue records by ordering them from a ‘record utility’, that functions as a data-record broker. The largest of these is the Online Computer Library Center (OCLC), with 11 337 member libraries in 38 countries (OCLC 1990). OCLC member libraries catalogue their holdings into a master database over a private OCLC network. If a record already exists in the OCLC database for an object being catalogued, the library obtains a copy of the record in MARC form, customizes it slightly, and then merges it into the local OPAC. By rigorously maintaining MARC standards which permit records to be imported and exported between systems, individual libraries are able to migrate their data to better software systems with relative ease. MARCcompatible systems are marketed for all sizes of machines. Libraries can pick and choose among these commercial products on the basis of functionality and performance. The independence of MARC data encoding rules and exchange record structures from software and hardware platforms, internal system design, and from conceptual data models is a valuable lesson for nascent community systems in systematic biology. MARC

for natural history specimens

Since its inception, the application of MARC has been expanding as libraries and museums have looked to manage new classes of accessions data. Recently, Stam and Palmquist (1989) studied the feasibility of using MARC for a university art collection. A recent museum association initiative, the Computer Interchange of Museum Information Committee (CIMI) is championing ISO 2709 for several types of museum accession data (Anonymous 1989). Bierbaum (1990) examined the suitability of MARC for museum materials with an emphasis on object registration.

Enterprise computing environment for systematics

249

In 1990, in collaboration with the MSU Library, we undertook a project to determine whether library cataloguing techniques and MARC format specifications could be applied to information from herbarium specimens. We used the MARC visual media (VM) format, which was designed in part to handle three-dimensional man-made or natural objects. Our first objective was to test the suitability of the MARC encoding specification for natural history accessions. For example, could we catalogue plant specimens with meaningful descriptors for those MARC fields which have vocabularies controlled by the US Library of Congress? Did required fields such as ‘Title Statement’ have conceptual reality for a biological specimen? Could the geographical hierarchy and detailed locality notes be adequately represented within existing fields? Could multiple taxon determinations be accommodated? We also wanted to determine the accessibility of plant specimen records in a university library OPAC and the OCLC Online Union Catalog. Would the databases index specimen records on fields of importance to a systematist? Would the controlled subject fields be useful access points for retrieval? What kinds of authority records would be needed to retrieve specimen records by queries on various levels of the taxonomic hierarchy? We will discuss our findings on these encoding and system questions elsewhere et al. but we were able to satisfactorily catalogue plant specimens into the OCLC Online Union Catalog using MARC VM according to AACR2 rules for graphic materials (Anglo-American cataloging rules 1988). The records were downloaded to tape (‘produced’ in OCLC parlance) and then imported into our university OPAC. A MARC approach for natural history data might be fruitful at one of several levels of integration with library systems. Firstly, there is no question that standard encoding, definition, and exchange formats which

would provide the type of data independence that MARC does would be valuable. Secondly, systematics might profitably adopt the basic cataloguing rules and encoding techniques of the library and use a ‘MARC-like’ exchange record format that could be designed specifically for systematics. Such a format could accommodate several classes of data that currently cannot be handled by a MARC or ISO 2709-compliant records such as binary images, hierarchical taxon information, spatial objects, and graph-structured data. Thirdly, and at a higher level of integration, we might co-ordinate our enterprise system design with the libraries’ and create a MARC format for natural history accessions which would allow us to move biological accession data in and out of their information systems. The international library computing infrastructure dwarfs what exists for systematics collections, and an entire guild of professional library cataloguers represents a rich source of

250

J. H. Beach

conceptual and practical expertise for accessions cataloguing in biology. The emergence of university information systems, in which library catalogues and library standards have an integral part, would also seem to weigh in favour of this approach. We demonstrated that botanical specimen data could be encoded and accessed in MARC-compatible systems. Without any software development, further standards development, nor any funds spent on hardware or networks, we put natural history specimen data into an international database (OCLC) from which the records could be downloaded by any member institution and into a university library OPAC which is accessible over the Internet to anyone at no charge.f There are significant institutional costs associated with OCLC membership and downloading records from OCLC, so OCLC need not be the prime repository for natural history MARC records. Library OPACs are an option—many larger OPACs are already accessible over the research internet. Botanical institutions might deploy their own network-accessible, MARC-compatible systems, or manage MARC-based records on personal computers. It is precisely the utility of the MARC standard, data independence from specific hardware or software platforms, that permits this flexibility. Summary Systematists collect, analyse, summarize, communicate, and publish infor-

mation within a collaborative enterprise of autonomous institutions that are widely distributed in space. The networks collapse the distances caused by space and time and provide the opportunity for immediate delivery of botanical information to the desktops of researchers, educators,

and

students. Data networks represent a key technical element for the future of systematics. An enterprise computing environment for the discipline must be pursued in technical directions that will allow direct and immediate access to botanical information for the entire systematics community and the public. Client/server architectures can provide useful, open access to biological data. Because servers return information to client databases in a usable form, they enable local projects to ‘feed’ directly off of community information resources. Starting a project database could be as simple and quick as mailing off a few network queries. + The Michigan State University Library TN3270. TELNET (VT100) users can use specify ‘MAGIC’. The first two specimens Cerastium tolucense Good. In OCLC they

OPAC, ‘MAGIC’ is reachable at 35.8.2.99 using 35.8.2.56, and at the ‘Which Host?’ prompt should in the database are of Magnolia grandiflora L. and are record numbers 22477424 and 22477438.

Enterprise computing environment for systematics

UpM

As information providers, institutions could transform existing projects into network-server resources. Some institutions would be sources of taxon information, others might have catalogues of type specimens. Bibliographic information might be a third network specialty; images of plants, another. We must develop the systematics computing environment within the framework of existing international network initiatives. Of particular relevance are national and ISO standards for data representation and client/server processes. The possibility of using MARC for accessions and for other types of taxonomic information is an intriguing direction. If we could engineer a ‘Natural History MARC’ standard, then systematics could exploit the libraries’ comprehensive computing and cataloguing systems. That would be a decisive start for networked computing and it would ensure compatibility with evolving academic information systems in the future. Acknowledgements

J. Gorentz and S. Ozminski of the MSU W.K. Kellogg Biological Station designed and implemented the Ingres specimen data mail server. L. Jizba and her staff, Department

of Technical Services, MSU

Library, led the

MARC specimen cataloguing effort. K. Roubicek of the NSF Network Service Center, K. Klemperer of the Dartmouth College Library and D. Troll of Carnegie Mellon University provided reference material. Several participants on the BITNET Campus-Wide Information Systems (CWIS-L@WUVMD) and Public-Access Computer Systems Forum (LIBPACS@UHUPVM1) Listserv lists pointed to relevant information. The work was additionally supported by US National Science Foundation Grant BSR 8822696 and a MSU Research Initiation Grant to J. Beaman and J. Beach. References

Anglo-American

cataloging rules (2nd edn) (1988). American

Library

Association, Ottawa.

Anonymous

(1989). MCN

[Museum Computer Network] initiates pro-

ject on computerized interchange of museum

information. Spectra,

16, (2), 9. Arms, C. (1990). The context for the future. In Campus strategies for libraries and electronic information, (ed. C. Arms), pp. 332-56. Digital Press, Bedford, Massachusetts.

Ascott, R. (1990). What is a network? Connectivity, transformation and transcendence. Spectra, 17, (3), 9-11. Beach, J. H. (1992) Client/server database architecture, networks, and biological databases. In : Data management at biological field stations and

252

J. H. Beach

coastal marine laboratories (ed. J.B. Gorentz), pp. 48-51. Michigan State University Press, E. Lansing MI.

Beach, J. H., Pramanik, S., and Beaman, J. H. (1993). Hierarchic structures for taxonomic database systems. In: Advances in computer methods for systematic biology: Artificial Intelligence, databases, com-

puter vision, (ed. H. Fortuner), pp. 241-56, Johns Hopkins University Press, Baltimore, MD.

Bierbaum, E. G. (1990). MARC in museums: applicability of the revised visual materials format. Information Technology and Libraries, 9, 291-9. Bisby, F. A. (1984). Information services in taxonomy. In Databases in systematics, Systematics Association Special Volume No. 26, (ed. R. Allkin and F. A. Bisby, pp. 17-33. Academic Press, London. Brentrup, R. J. (1990). Dartmouth College Information System: technical overview. Brochure, obtainable from User Communications Department, Computing Services, Dartmouth College, Hanover, New Hampshire 03755, USA. Cargill, C. F. (1989). Information technology standardization: theory, process, and organizations. Digital Press, Bedford, Massachusetts. Crawford, W. (1984). MARC for library use: understanding USMARC formats. Knowledge Industry Publications, White Plains, New York. Fenly, J. G. and Wiggins, B. (ed.) (1988). The Linked Systems Project, OCLC Library, Information and Computer Science Series No. 6. Online Computer Library Center, Dublin, Ohio. French, J. C., Jones, A. K., and Pfaltz J. L. (1990). Summary

of the

final report of the NSF workshop on scientific database management. SIGMOD Record, 19, 32-40. Frey, D. and Adams, R. (1990). /%@:: a directory of electronic mail addressing and networks. O’Reilly and Associates, Sebastopol, California. Gaudette, P. (1989). A tutorial on ASN.1, Automated Protocol Methods Program, Systems and Network Architecture Division, National Computer Systems Laboratory, technical report NCSL/SNA 89/12. National Institute of Standards and Technology, US Department of Commerce, Gaithersburg, Maryland, USA. Granrose, J. (1990). Anonymous FTP list. Obtainable via anonymous FTP from 128.6.7.38. IBM (1990). NSFNET—The National Science Foundation Computer Network for Research and Education. Brochure, obtainable from International Business Machines Corporation, Academic Information Systems, 472 Wheelers Farms Road, Milford, Connecticut 06460, USA.

ISO (1981). Documentation—format for bibliographic information interchange on magnetic tape, standard 2709, (2nd edn). Obtainable

Enterprise computing environment for systematics

253.

from Global Engineering Documents, 2805 McGraw Avenue, Irvine,

California 92714, USA. ISO (1987). Information processing systems—open systems interconnection — specification of abstract syntax notation one (ASN.1), standard No. 8824. Obtainable from Global Engineering Documents, 2805 McGraw Avenue, Irvine, California 92714, USA.

Karp, P.D. (1990). The ASN.1 printfile parser and path-manipulation package. National Center for Biotechnology Information, technical report No.5. US National Library of Medicine, Bethesda. Obtainable via anonymous FTP from ncbi.nlm.nih.gov (130.14.20.1). Klemperer, K. and Levine, L. M. (1990). Dartmouth College Information System: general overview. Brochure obtainable from: User Communications Department, Computing Services, Dartmouth College, Hanover, New Hampshire 03755, USA. Kunze, J. A. (1990). UCB Network Information Server—project overview. Unpublished manuscript based on presentation at the Association for Computing Machinery, Special Interest Group on University and College Computing Services, User Services Conference XVIII, Cincinnati, Ohio.

LaQuey, T. L. (1990). The user’s directory of computer networks. Digital Press, Bedford, Massachusetts.

Lynch, C.A. (1991). The Z39.50 information retrieval protocol: an overview and status report. Computer Communication Review, 21, (1), 58-70. NISO (1988). Information retrieval service definition and protocol specifications for library applications, standard No. Z39.50—1988. Obtainable from Transaction Publishers, Rutgers—The State University of New Jersey, New Brunswick, New Jersey 08903, USA

OCLC, (1990). OCLC annual report 1989-1990. Online Computer Library Center, Dublin, Ohio.

Ostell, J. (1990). Geninfo backbone database overview, National Center for Biotechnology Information, technical report, No. 2, version 1.0. US National Library of Medicine, Bethesda. Obtainable via anonymous FTP from ncbi.nlm.nih.gov (130.14.20.1). Pankhurst, R. J. (1988). Database design for monographs and floras. Taxon, 37, 733-46.

Quarterman, J. S. (1990). The matrix. Digital Press, Bedford, Massachusetts. St George, A. and Larsen, R. (1991). Internet-accessible library catalogs and databases. Obtainable via anonymous FTP from ariel.unm.edu (129.24.8.1). Stam, D. C. and Palmquist, R. (1989). SUART: a MARC-based information structure and data dictionary for the Syracuse University

254

J. H. Beach

art collection, project report. Obtainable from The Museum Computer Network, School of Information Studies, Syracuse University, Syracuse, New York 13244, USA.

Troll, D.A. (1990). Library Information System II: progress report and technical plan. Public-access Computer Systems Review, 1, 4-29. Obtainable from Listserv at Bitnet node UHUPVM1. USMARC formats for bibliographic data, including guidelines for content designation (1988). Library of Congress, Network Development and MARC Standards Office, Washington, USA.

24.

Networks and communications: Internet

the

KAREN C. ROUBICEK NSF Network Service Center, Bolt Beranek and Newman Inc., Cambridge, MA 02138, USA

Abstract

A design for a global plant species information system must provide for mechanisms that facilitate collaboration and communications between geographically separated research groups working on common problems and that provide remote access to unique resources. Computer networks reach users at thousands of institutions in countries throughout the world. Wide-area computer networks, once the exclusive domain of computer scientists, have become

an indispensable tool to members

of the scientific

community. Researchers in biology, physics, chemistry, and other disciplines routinely use supercomputers, databases, and on-line bibliographic information regardless of their locations. As the academic, government, and industrial sectors within and outside the United States commit more resources to communications infrastructures, the use of applications such as videoteleconferencing and high-resolution image transfer will become more common. New developments in very-high-speed networking and networked digital library systems will play an important role in the way that scientists do their work and in the kinds of research that they can pursue. Introduction

Computer networking, both local and wide area, is a tool that allows network users to make productive use of the research data that they have collected. Scientists in disparate locations who are working to solve common problems use networks to discuss their ideas and discover solutions. In recent years there has been an explosion in the development of network technologies and in the creation of new communication paths. © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 255-61. Oxford University Press, Oxford.

256

K. C. Roubicek

Researchers have many options to choose from, and those with greater resources can take advantage of multiple options. In his book The Matrix John Quarterman (1990) describes a worldwide metanetwork of hundreds of connected networks that form the Matrix. Protocols, or the

rules or conventions that two or more computers must follow in order to communicate, vary across networks.

The Internet

The Internet is a logical network comprising a global set of interconnected national and regional networks that use the TCP/IP communications protocol. Because increasing numbers of researchers around the world are either directly connected to the Internet or have access to it via gateways, they can take advantage of an already existing communications infrastructure. The Internet is important to the community of botany researchers because it allows the rapid transfer of data for efficient interactions, and resource and information sharing between collaborating research groups; it reduces the communication costs by permitting the sharing of common bandwidth; it provides easy access to the international scientific community; and it provides reliable wide-area connectivity. The current Internet has its origins in the ARPANET, which was the network established by the Advanced Research Project Agency of the US Department of Defense in the late 1960s. The Internet connects over 2000 networks at university campuses, government laboratories, research and development divisions of commercial organizations, and military installations. These sites are located in North America, Latin America, Europe, and the Pacific Rim.

The international Internet

1. NSFNET In the United

States, the National

Science

Foundation’s

NSFNET

has

expanded exponentially since its inception in 1986; it currently interconnects hundreds of institutions via regional and state networks across the country (see Fig. 24.1). Networks established by other US government agencies, such as the Department of Energy and the National Aeronautics and Space Administration (NASA), are connected to the NSFNET by gateways which pass traffic from one network to another. Commercial services, such as Compuserve and MCI Mail, are also accessible via gateways to the Internet. Gateways that connect the NSFNET to networks in countries outside the

The Internet

257.

Hawaii

«

aay

:

P

ry

a

yan,

a 4

xac

)

ay Ar

Aral

F

pes,

vie a)

g27gn i)

Ciara

p Wa SST

t

i

in 2 iv as ir m944!

oath

‘7

htees :

A, ot ys

rae Ws 14 rah tg Meeaeaie ;

+ J

a 9

bf M6.

Coe Loc

nant a

iq

A

bok

-

i

ars

“ne

veTPRER ty nen of cb Ce

; WW2). The ame on) ed a

Fe

iron aa

te, 2 ay »

rey

wy

we)

Fait

.

PART

4.

Data structures and software

25.

Alternative models for taxonomic data CATHERINE ZELLWEGER *%* and ROBERT ALLKIN?¢ *Conservatoire et Jardin Botaniques, Genéve, CH-1292 Chambésy, Switzerland tRoyal Botanic Gardens, Kew, Richmond, Surrey TW9

3AB, UK

Abstract

A sequential series of data analysis and design stages should precede implementation of any information system. This chapter describes the sequential nature of design and then concentrates on the earliest, more conceptual stages. We emphasize that important botanical decisions are necessary at the beginning of any project and explore the repurcussions of such design decisions. The first design step is to generate a conceptual model for a particular research area (for example, systematics) that identifies the objects under study and their properties. This model should also define the relationships among those objects. There are alternative modelling methods and any conceptual model built will reflect the constraints implict in the data model used to build it. The conceptual model is, however, independent of implementation constraints and serves as a formal tool to interpret and understand what data analysts term the ‘real world’. This term gives the misguided, if reassuring, impression that there is a single reality: a unified view of objects

and

their interrelations.

Unfortunately,

in practice,

even

the

conceptual model for an information system will reflect the particular purpose of those building the system. Different projects, or even individuals within a project, will have different purposes and priorities. Depending on one’s purpose and priorities one model or another may be more ‘elegant’: none is uniquely correct. It is essential that a clear statement of intent and detailed descriptions of the information ‘products’ required © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 265-74. Oxford University Press, Oxford.

266

C. Zellweger and R. Allkin

of any project are established and agreed before further design work can procede. We use simple entity-relationship diagrams to explore the differences between alternative conceptual models for information systems for projects that are primarily taxon-based or primarily name-based. The literature currently reflects confusion over the significance of this important difference.

Introduction

Within the last 10 years there has been an increasing activity within botanical database management. A great diversity of database projects have been initiated which vary in numerous respects: purpose, size, the scope of the information contained, the class of user, the mechanisms and

degree of access allowed to end users, and the data processing methods, software or hardware used. The diversity among these systems reflects the diversity of purpose for which they were built although all form part of an existing global plant information system (GPIS). Given the shortage of skilled botanical manpower, existing institutional commitments, and the manner in which our community works any GPIS will continue to involve diverse projects with their own agenda, databases, and technology (Leonard, personal communication). The diversity of systems also reflect that even given apparently similar objectives, different design choices result in very different systems. Trapped between the complexity of the information that we handle (Allkin 1984), the limitations of the available analytical or conceptual techniques, and above all of available software (Allkin 1988), projects have simplified data management. How they have chosen to simplify their database design reflects different perceived or implicit priorities. During these last 10 years, there have been significant advances in both database technology and computer science. Possibly most significantly new techniques and models have become available to botanists. Botanists are also beginning to become more critical in their understanding of the nature of their various tasks. These advances will continue in both fields possibly at an accelerated pace. Designs for a GPIS today therefore, must be forward-looking since we will not create the ultimate system overnight. Techniques for modelling reality

During the 1970s there was a perceived need for a radically new approach to developing information management systems. This resulted in new methodologies for modelling information systems and in particular for

database construction. In 1975 an ANSI/SPARC report (Yormark 1977)

Alternative models for taxonomic data

267

described a three-level architecture for modelling information systems that forms the basis of the modelling methodologies that we inherit today. We cannot describe theoretical aspects of this model here but present a brief account of the three sequential design levels recognized (Fig. 25.1).

1. The conceptual model This level most closely approximates to some perception of the real world. It attempts a formal description of the domain of interest and is essentially independent of implementation constraints. It should, therefore, provide a relatively stable representation of the domain over the longer term. A conceptual model of a botanical information system will inevitably change, however, as our understanding of that domain improves. How well we understand that domain is to some extent dependent on the modelling techniques that we use. 2. The logical model The logical model (sometimes called the external schema) represents both the data themselves and the treatment of those data within the context of a particular application. This, therefore, is a dynamic representation of the data which inherits definitions of the elements from the conceptual model and reflects the expressed needs and constraints of the particular application. 3. The physical model The physical model (also known as the internal schema) completely describes a working database system. It reflects the practical solutions

REAL

WORLD

WV

CONCEPTUAL MODEL

WM

LOGICAL

MODEL

PHYSICAL

IMPLEMENTATION

Fig. 25.1. A three-level architecture for information system design.

268

C. Zellweger and R. Allkin

chosen and constitutes a detailed implementation plan for a working system derived, nevertheless, from the logical model.

Thus development of a database system from conception to final realization conventionally comprises a series of design activities that cascade one from the other. If users have supplied the designer with the correct ingredients at the outset of the design excercise along with a sufficiently precise non-ambiguous set of constraints, then these techniques are designed to ensure that the resulting system meets the user’s needs. It is recognized that by far the majority of project failures result from poor initial system specification. The relatively enormous costs (80 per cent) of resolving errors in working systems are those that are due to design errors rather than the costs (20 per cent) due to errors during system implementation (Bell et al. 1987). Entity-relationship diagrams

Since the conceptual model is independent of choices made during implementation and is closest to our mental image of the real world it provides an excellent communication tool. It allows discussion about alternative data models which is completely divorced from implementation issues. The entity—relationship model (Chen 1976) is one model which has become widely used. Since the objects of study are represented using a formal mechanism, botanists and information specialists can communicate and the former actively participate in this design phase: it is essential that their view be expressed and incorporated in the final design (King 1987). The entity—-relationship model uses a representation of the real world based on two classes of object: entities and relationships. One diagramatic technique used permits explicit representation of the model. Figure 25.2 gives an example of this technique and of the conventions that we follow. ‘Entities’ are object classes that we can identify and for whose members we wish to store information. Diagrammatically, entities are normally represented by boxes. Thus in Fig. 25.2 one box represents the class of objects ‘taxa’ and the other box represents the class of objects ‘specimens’. We may choose to record information about either specimens

is a member of

Fig. 25.2. An entity—relationship diagram for ‘taxa’ and ‘specimens’.

Alternative models for taxonomic data

269

or taxa. ‘Relationships’ are the names given to particular associations existing among entities. They are generally represented diagrammatically as diamonds, circles, or simply as a line linking the two associated entities with the name of the association written above. The example in Fig. 25.2 shows a ‘membership’ association between specimens and taxa. The model also permits representation of semantic information such as the nature of each pairwise relationship: is it one-to-one, one-to-many, or many-to-many. This information is represented in Fig. 25.2 by use of a forked branch to indicate a multivalued end to the relationship. One taxon may have many specimens. One specimen can only belong to one taxon. A many-to-many relationship would be represented by a line forked at either end. Alternative models for taxonomic data

1. Modelling the name-taxon relationship To illustrate the use of entity—relationship diagrams to model taxonomic data we will explore one of the common and most fundamental elements in taxonomic data management systems: the relationship between taxa and names. There is current, often confused, debate revolving around the

merits of ‘taxon-based’ or ‘name-based’ database systems. We will illustrate how entity-relationship models and other analytical tools may be used to resolve such confusion and further add to our understanding of our work as sytematists. We begin by defining our conceptual model. A taxon (such as a variety, species, genus, family) is an entity considered by some taxonomists to be either a biological reality or a useful abstract notion. Either way it represents a collection of populations of individual specimens. A name, any Latin binomial, trinomial, or indeed uninomial (e.g. Vicia L.), is a label used, either as a synonym or as an accepted name, to refer to one of these biological or abstract entities. The species concept is one issue we will not discuss here although it clearly has a bearing on conceptual models for biological information systems! The second stage is to use data entity—-relationship diagrams to model this ‘reality’. Individual taxa are related to one another within a hierarchical framework. A series of ‘one-to-many’ relationships (Fig. 25.3(a)) exists between pairs of taxa since one taxon can ‘own’ any number of taxa at the next level down while itself belonging to only one taxon at the next highest level. Names may also have ‘one-to-many’ relationships among themselves: various orthographic variants can be linked to one ‘correct’ spelling (Fig.

25.3(b)), a number of homotypic synonyms can be linked to a basionym, and numerous heterotypic synonyms may be linked to an accepted name. These three example relationships between names are not equivalent

270

C. Zellweger and R. Allkin orthographic variant of

is a member of

iy

(a) Fig. 25.3. Entity-relationship name-to-name relationship.

(b) diagrams:

(a) a taxonomic

relationship;

(b) a

in taxonomic parlance. They illustrate a decreasing degree of binding: orthographic variants are linked far more intimately to one another than are heterotypic synonyms. Description of such relationships requires a more exhaustive model than we can attempt here. Note that we also ignore for our current purpose that each name is itself composed of a number of components including the authority. The crucial relationship between taxa and names that we intend to examine in more detail here is a ‘many-to-many’ relationship (Fig. 25.4(a)). A taxon may have any number of synonyms or other names associated with it. A single published name may be used to refer to a number of biological entities. To clarify such relationships convention dictates that we first envoke a third, intermediate entity. This ‘name-taxon’ entity might be thought of as a ‘name use’ entity, i.e. individual cases of this entity would be the application of any particular name to one taxon (Fig. 25.4(b)). Instead of a single ‘many-to-many’ relationship we now have two ‘one-to-many’ relationships to consider. One name may be used in a number of different senses and one taxon may have a number of different names applied to it. A fourth ‘reference source’ entity could be added to this diagram linked directly to the ‘name-taxon’ entity and this providing a precise mechanism for citations (Fig. 25.4(c)).

2. Recording descriptive data There will undoubtedly be some information in any taxonomic database that relates specifically and uniquely to names: information about the type, the place and date of publication of that name, and so on. Such information clearly should be linked to the name entity. A more interesting question, however, is to which entity in our model should we attach morphological, ecological or chromosome data? For example does the observation ‘Adansonia digitata L. is a tree’ provide us with information

Alternative models for taxonomic data

Paypal

about the name or the taxon? Initially we would probably respond that such an observation relates directly to the biological entity (Fig. 25.5(a)): the name is after all only the label used to refer to the taxon. An alternative view is that a taxon is an artefact and that its description comprises the sum of the data published under any of the alternative names currently assigned to it. Rather than having published observations such as ‘Ervum faba L. is edible’ attached to the taxon, this model associates the observation directly to the name Ervum faba L. (Fig. 25.5(b)). Thus the link between an observation and the name originally used by its author is retained. Systems implemented based upon the model can use such information. Should nomenclatural rearrangement occur for example and E. faba, currently a synonym of Vicia faba L., become placed under synonymy in another taxon, then the observation ‘is edible’ will cease

N\ VW

Z \

Bibliography

Fig. 25.4. Entity-relationship diagrams: names, taxa, and references: (a) names and taxa; (b) an intermediate entity; (c) precise nomenclatural citation.

272

C. Zellweger and R. Allkin

to be a property of V. faba and simultaneously of knowledge for the new taxon. A third view recognizes further that names are different authors to refer to taxa with the same observations therefore are attached to the ‘usage’ refer to a particular taxon on a particular occasion

contribute to the body not necessarily used by circumscription. Data of a particular name to (Fig. 25.5(c)). Systems

current

published

(c) Fig. 25.5. Alternative views of descriptive data: (a) taxon based; (b) name based; (c) taxon—name based.

Alternative models for taxonomic data

273

built on such a model, in addition to reflecting nomenclatural changes, in time may also reflect multiple nomenclatural views simultaneously. Notice that thus far we have given no consideration of implementation. We

have, for example,

still to consider

how we wish to derive taxon

descriptions (for example, of a genus) based on a summation of taxa (for example, descriptions of all species in that genus). Indeed we wish to do that at all may depend upon the particular project. certainly avoided considerations of whether one model or another cheaper, or quicker to implement as a working system.

member whether We have is easier,

Conclusions

Use of a data model such as entity-relationship diagrams permit the comparison and exploration of different approaches to viewing an information system in a particular domain. Alternative models are cheap to produce, are independent of subsequent implementation constraints, and should be used as communication tools in the design process. An entity—-relationship diagram gives a clear unambiguous view of the data. Techniques such as these offer taxonomists themselves the chance to define their system and make their own choices. It reduces the impact of constraints dictated to them by the available hardware and software. They become more independent of system analysts who think they know what it is that taxonomists do. Choices made in the conceptual schema allow subsequent precise definition of the requirements to be met by the implemented system. This will eventully qualify the selection of suitable software and hardware. Through funding and other pressures it is common practice for projects to expect information systems to be built in haste. It is certainly expected that just ‘pieces’ of the system can be got working in a short time. It is now well documented that to implement any system without sufficient care and attention being paid to conceptual analyses early in the design phase or to seek out all possible ‘views’ of the data model design is foolhardy and very costly, particularly in the longer term. References

Allkin, R. (1984). Handling taxonomic descriptions by computer. In Databases in systematics, Systematics Association Special Volume No. 26, (ed. R. Allkin and F.A. Bisby), pp. 175—88. Academic Press, London. Allkin, R. (1988). Taxonomically intelligent database programs. In Prospects in systematics, (ed. D.L. Hawksworth), pp. 315-31. Clarendon, Oxford.

274

C. Zellweger and R. Allkin

Bell, D., Morrey, I., and Pugh, J. (1987). Software engineering. Prentice Hall International, London.

Chen, P. P.-S. (1976). The entity-relationship model: toward a unified view of data. ACM

Transactions on Database Systems, 1, 9-36.

King, P.J.H. (1987). The database design process. In Entity-relationship approach, (ed. S. Spaccapietra), pp. 475-88. Elsevier, Amsterdam. Yormark, B. (1977). ANSI/X3/SPARC/SGDBMS architecture. In The ANSI/SPARC DBMS model, (ed. D.A. Jardine), pp. 1-16. North Holland, Amsterdam.

26.

Practical links between specimen and taxon databases ROBERT E. MAGILL Missouri Botanical Garden, P.O. Box 299, St Louis, MO 63166, USA

Abstract

The development of a general-purpose indexing scheme capable of providing cross-reference addresses to related, although not necessarily formatcompatible, databases will be required for access to the complete range of botanical data incorporated in a ‘global botanical database’. Botanists and horticulturists are no longer content with access to a single data source. They want easy access to a variety of resources with transparent seams between them. An indexing scheme using a modular names approach, similar to that currently in use by several database projects, could provide for greater access to data by either providing properly formatted entry identifiers to multiple databases running on the same platform or by providing access to summarized data capsules provided by other data sources. A data bank of name elements would be a necessary component to provide a library of units available for sending a name to the index and would also serve as a glossary of available, correct, or ‘accepted’ pieces of a modular name. The records in the indexing scheme would point to record identifiers for data stored by names from a variety of sources, for example, botany (nomenclatural and/or taxonomic), horticultural (hybrid, grex, cultivated), ecological, floristic (flora, checklist), and could provide additional cross referencing to specimen or collections data associated with a specific name.

The botanical community has only just begun to accumulate information about plants electronically and already the potential is so great and the need so pressing that there is little doubt that a global system will be put into place. We must acquire the means to provide responses to critical questions © Systematics Association, Special Volume No. 48, ‘Designs for a Global Plant Species Information System’, edited by F.A. Bisby, G.F. Russell, and R.J. Pankhurst, 1993, pp. 275-83. Oxford University Press, Oxford.

276

R. E. Magill

from the ‘public’, as well as interdisciplinary and broadly based botanical projects. So today we are building marvellous repositories of information about plants in specimen databases, taxon databases, bibliographic, floristic, and medical databases, to name only a few.

Increasingly, questions are being asked that cannot be answered by accessing only one of these sources. At present, however, we are unable to tap the reservoir of information captured in the many small, individual, databases or the larger botanical storehouses. Likewise, many databases,

horticultural or medical-use for example, are still being built in ways that isolate them from repositories of conservation or botanical data. Thus we continue to drift away from a global information system citing institutional priorities, hardware, software, and incompatibility for our reluctance to co-operate.

The problems of interchange between specimen- and taxon-based botanical systems will serve as a model for the indexing technique outlined below. This idea could be applied to any databases where plant names are incorporated. The plant name is the key element running through all botanical projects. Each project dealing with plant data has approached the problem of storing and retrieving names in a slightly different way, but in general each record contains a name or pointers to one. The plant name may be the primary element in a nomenclatural database, a temporary identity attached to a record in a specimen database, or only a keyword in a ecological or bibliographic database, but it provides the key needed to unite all of these sources. What I would like to propose is a universal cross-indexing technique that could provide the bridges necessary to link previously distinct data types, for example specimen—taxon, horticultural—conservation, nomenclatural —horticultural, or simply data of different formats in separate or isolated databases. I have approached this cross-indexing technique so it could be incorporated within a user’s current software indexing mechanism or more likely, especially with smaller systems, a separate module used only to connect the user’s system to related exotic data elements. The link between the user’s database and the exotic data elements would be plant names contained somewhere within the user’s database. The scheme consists of five parts. The first is a user’s database that contains plant names either as text or in a coded form. The second part is a small program or routine that would manipulate the name to provide the name-key to the cross-index. The third part is the cross-index itself, most often a flat file containing some number of records that hold pointers to the exotic data bases elements. Fourth, the data files holding the exotic data, and fifth a program to build and maintain the cross-index. It is probably not necessary to expand on the first part of the scheme,

Links between specimen and taxon databases

210

except to say that the plant name need not be the record identification number or even a major element in the user’s database. Any part of the database that could provide access to a plant name might be used. In most cases this will probably be a single name; but lists of names, from for example list of synonymy, parents of hybrids, or specimen lists from a plot, could be used to address multiple connections by expanding the complexity of the scheme. The second part of the cross-indexing scheme is a program that would provide access to the cross-index and the exotic data elements. The program would need to be called from one or more prompts after a record from the user’s database is present. At the minimum level this program would be required to perform five functions: (1) to obtain a name from the current record, (2) preform the proper coding of the supplied name, (3) read/display list of records in the cross index for that name, (4) display the exotic data elements as requested (although this might be handled by other modules or the main program depending on the system in use, and (5) return control to the main program. These functions require a few words of explanation: 1. Obtaining a name might be as simple as passing a names variable from the current record to the program or it might require additional instructions from the user, for example an indication of which field contains the name field or names list. 2. Performing the proper coding of the supplied name will require two steps. First it must receive instructions on the type of coding to be performed. The type of database the name is coming from and going to will determine how the name is coded. This scheme is intended to provide access to a variety of databases that use various ‘types’ of names. For example a name provided by a nomenclatural database will not reflect intermediate ranks that are present in a taxonomic database. Likewise a horticultural database may include a great number of names types, for example grexes, hybrids, and even common names. Some of these names may not have a direct connection to another database or the level of connection may not be parallel. For example the fictitious cultivar Rosa carolina cv newcolor in a horticultural database might only refer back to Rosa carolina in a nomenclature database. The second step is the actual coding of the name using the guidelines set out by the coding instructions. A simplified flow chart of one such procedure is provided in Appendix 1. 3. Read/display list of available records in the cross-index for the name. This may require an additional step if there are several exotic data bases to chose from. 4. Display selected record(s) from the exotic data file. 5. Return control to the calling program. This brings us to the third part, and the most critical aspect of the cross-indexing technique, the index itself. The cross-index consists of two

278

R. E. Magill

Table 26.1. Examples of items in name parts file. RANK

SPEC

GEN rec

data

rec

data

rec

data

1 2

Aa Aachenia

1 19 . 435 : : 3284 : : 12365 45023 78504 ;

acba alba

genus subgen section subsect species subspecies variety subvariety forma : : hybrid grex

sg se st

: 102 : : 3456 : : 35000 =

Acacia

Iris

Zea -

Aa Aachenia Acacia : Iris

1 2 102

Zea

35000

acba alba : beta

beta

caspus

gerria Helmit newcolor °

sp va sv fo

».4 gx

19

435

3456

files, a file of name parts that could be used to code the names and the index of exotic records. The names file would consist of all name parts (as examples I have only used those of genus and below although all ranks could be used) that have been properly published and/or are in current use (Table 26.1). This may be an excellent time to set up such a list with the Current Use projects under way and a good list of genera in ING (see Greuter, Chapter 11 this volume). The need for the names list is twofold: first to provide a standardized list of correctly spelled name parts to be used in coding of names. This file would also be helpful in cleaning existing databases by alerting users to misspelling and incorrect endings, as well as helping to standardize the names we provide to the ‘public’. Second, although the name itself could be used as the key, internal numeric coding would shorten the length of the cross-index identifier. Any number of routines are available for manipulating the file of name parts. One simple method that could provide quick access to the names or their codes would be to use both as record identifiers in the file thus eliminating the need for additional indexing.

Links between specimen and taxon databases

279

Table 26.2. Example of coded names. Name

Specimen database

Nomenclatural database

Acacia Acacia karroo Acacia karroo var. alba A. karroo ssp. alba var. beta A. karroo ‘newcolor’ A. x rubra A. rubra x alba A. grex Helmut

102 102-6284 102—6284val9 102—6284sp19va435 102-—6284c178504 10212365 102-12365X19 102gX45023

102 102-6284 102-6284val9 102-6284va435 102-6284 102X12365 102—12365 and 102-19 (no connection)

The coding itself (Table 26.2) would be similar to schemes used in several database projects (ESFEDS, PRECIS, and TROPICOS) in combining numbers associated with name parts in a logical sequence separated by rank delimiters. Some database projects may choose to store only the coded names, for a variety of reasons, for example consistency of stored data, ease of global corrections, size of stored object, quicker access to the coded data. The coding could be preformed by the cross-indexing program as needed. The coding of names would vary depending on the form of the name in the exotic database, i.e. nomenclatural, taxonomic, horticultural.

The individual cross-index records would have these codes as record identifier. Each individual record in the file would hold element identifiers for all of the items in the exotic database that used the name (Table 26.3). The cross-index could be developed in a multivalued format with each

Table 26.3. Examples of cross-index file types. Multivalued file INDEX

001 002

102 U8945]870123]881002... 19]89765]1184]...

001 002

102-—6284V19 U9873]880001]... 104]106...

INDEX.HORT

Flat files INDEX.NAME

001 002 003

102 U8s945 870123 881002

001 002 003

102 19 89765 1184

001 002

U9873 880001

001 002

104 106

001 002

87454 87455

(no connection)

102gx16010

001 002

87454]87455S...

280

R. E. Magill

record containing pointers to many other database or separate indexes could be maintained for each database. For example, using a multivalued index, the first attribute might contain all of the horticultural record numbers using this name, while attribute 2 holds the record identifiers with the same name from a specimen database. The other option would be to have separate indexes for each database or perhaps additional codes could be added to the index identifiers to identify the different information sources. The fourth part of the cross-indexing scheme is the data from the exotic sources. This could be files specifically formatted for the user’s database if the same software were being used by both projects, as is the case with the botanical and horticultural data base at the Missouri Botanical Garden, or it

might be as simple as a synopsis text record in ASCII. The data would come from the exotic database with its original record identifier so that queries, corrections or updates could be returned easily. I see this latter case as the most probable scenario for most exchanges. The synopsis text would consist of a brief summary of the information in the exotic database, probably with the coded name appearing as the first item in each record. The data could also include contact names or addresses as necessary to identify the source database. The fifth part is a program that will create and maintain the cross-index as data are added from other sources. Since the data from these exotic sources are generally for display only and probably would not be altered, this program would only need to add the record numbers from the exotic file to the cross-index record for the taxon. If the taxon code was not supplied with the record, coding of the name would need to be done first. Using a universal name part numbering system would assure that the information from various data sources would be associated with the proper taxon. It would also result in fewer problems with spellings and current usage. This is assuming general acceptance of the ‘Names in Current Use project’; otherwise more information would be required in each key to separate homonyms. Exchange of data has posed major problems especially when data are moved to a new environment or if intercalation of an exotic data set is anticipated. These problems have been addressed by the XDF format for these types of exchanges. Addition of supplemental data (value-added files) or reference pointers to small isolated databases are, however, another matter. Connections between isolated databases could provide many benefits now, as we work toward the global system. One such example is the connections between pcTROPICOS and BG-BASE that are currently under way (Fig. 26.1). We are able to use the above scheme to access pclROPICOS nomenclatural data from a name record within BG-BASE or examine horticultural specimen records from pcTROPICOS. Since both

input

name

Provide before

name or

to

after

BG-BASE access

index to

Cross-Index

To gain access to pcTROPICOS and specimen data bases from BG-BASE

other the name would be passed back from BG-BASE to the Cross-Index to interpret the name and passes the results to the index of the data base requested.

CROSS

FULL

NAME

CONVERSION

INDEX

that

will

provide

KKKKKKK >>> >>>

verified

names

to

PROGRAM

coded an

and

index

Nomenclatural

link

from

and EXSICCATAE

to pcTROPICOS

link

Associated

to

data

specimen from taxon

collector & no.

data

link to data

from pcTROPICOS

HORT

accession IDs

Fig. 26.1. Simplified schematic diagram showing connection of BG-BASE pcTROPICOS using the cross-indexing technique.

and

282

R. E. Magill

databases use REVELATION,

the simplest scenario cited above, i.e. by

passing the name variable to the opposite program, has been used. Another example, using the multivalued cross-indexing option is currently used to connect herbarium specimens at the Missouri Botanical Garden to the name record in TROPICOS (or the reverse). I believe that this cross-index scheme could begin to provide the necessary pointers, reference points, and supplementary data that many researchers need to respond adequately to critical conservation and legislative questions. These data elements need not be any more than flags indicating, for example, that the plant exists in a botanical garden or where an illustration may be found; but they would begin to provide the connection that we will need for a global system. The global system will undoubtedly also have restraints on it, size, scope of coverage, format, etc. This type of cross-indexing scheme could provide the same type of pointers, reference points, and supplementary data to the global system thus expanding it well beyond its initial limits. Appendix 1: Simplified flow chart of cross-index program operations

1. INPUT NAME

All names entered from keyboard would pass to the index of the parent program or an option menu if more than one system were running on the same computer, e.g., BG-BASE and pcTROPICOS. If the name is being passed from the parent program for access to another data base goto 2.

2. Check current (usually first) word against GENUS File. 2a. if present, add code to XNAME,

LEVEL=1;

goto 3.

2b. is word an acceptable annotation; get next word and goto 2 2c. error condition in effect. set NEW. NAME flag or report condition and RETURN to parent program. 3. Check for next word; if not present goto 8.

3a. 3b. 3c. 3d.

is is is is

word word word word

a rank; goto 4 an author; goto 5 an acceptable annotation; goto 6 a name; goto 7

4. Check rank file.

4a. if valid rank, add code to XNAME

1. if nomenclatural rank; goto 3.

and

Links between specimen and taxon databases

283

2. if hybrid, check placement of ‘X’ and return to 2 or 3 depending on current level. 3. if cultivar or grex then only one more name possible set CULTIVAR FLAG; goto 3. 4b. if word is not a rank; goto 3b. . Check author file.

5a. if valid author; goto 3. (authors may be needed in some situation, thus their codes might be added to XNAME or to a separate author variable as needed.) 5b. if word in not an author; goto 3c. . Check annotation file.

6a. if valid annotation or comment; goto 1 or 3 depending on current LEVEL. 6b. if word is not annotation, must be name; goto 1 or 3d depend on current LEVEL. . Check for word in NAMES

file.

7a. if present, add code to XNAME,

LEVEL+1;

goto 3

7b. if not present, error condition in effect. set NEW.NAME report condition and RETURN to parent program.

flag or

. OUTPUT TO CROSS INDEX If LEVEL less than 4 then use XNAME index of selected exotic data base.

as record id to cross

If LEVEL greater than 3 then: a. Use XNAME

as record id for all taxonomic data sources,

i.e. genus species and all ranks and subspecific names. b. use first two codes and last two codes from XNAME as record id for all nomenclatural data sources, i.e. genus species and last rank and subspecific name.

27.