Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure 9781463219222

The essays in this volume reflect a new generation of classicists hunting for new methods to understand and to dissemina

172 109 51MB

English Pages 485 [489] Year 2010

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure
 9781463219222

Citation preview

Changing the Center of Gravity

Bible in Technology

4 Series Editor Keith H. Reeves

The Bible in Technology (BIT) is a series that explores the intersection between biblical studies and computer technology.

Changing the Center of Gravity

Transforming Classical Studies Through Cyberinfrastructure

Edited by

Melissa Terras Gregory Crane



 2010

Gorgias Press LLC, 180 Centennial Ave., Piscataway, NJ, 08854, USA www.gorgiaspress.com Copyright © 2010 by Gorgias Press LLC All rights reserved under International and Pan-American Copyright Conventions. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise without the prior written permission of Gorgias Press LLC. 2010

‫ܓ‬



ISBN 978-1-60724-881-1

ISSN 1943-9369

Library of Congress Cataloging-in-Publication Data Changing the center of gravity : transforming classical studies through cyberinfrastructure / edited by Melissa Terras and Gregory Crane. p. cm. -- (Bible in technology, ISSN 1943-9369 ; 4) Includes bibliographical references and index. ISBN 978-1-60724-881-1 1. Classical philology--Study and teaching. 2. Classical philology--Electronic information resources. 3. Civilization, Classical--Study and teaching. 4. Civilization, Classical--Electronic information resources. I. Terras, Melissa M. II. Crane, Gregory, 1957PA74.C48 2010 480.071--dc22 2010005084

Printed in the United States of America

For Cathy, Lincoln, Adrian and Russell. In memory of Allen Ross Scaife.

TABLE OF CONTENTS Table of Contents...................................................................................vii Foreword ................................................................................................xiii Preface......................................................................................................xv Acknowledgements ..............................................................................xvii Ross Scaife (1960-2008) .......................................................................xxi Cyberinfrastructure for Classical Philology..........................................1 Terms and continuities ...................................................................5 Wissenschaft and Philology ...........................................................5 Classics and the Humanities ..........................................................7 Infrastructure .................................................................................10 Classics in 2008..............................................................................12 Digital Incunabula: the Thesaurus Linguae Graecae (1972) .........20 Machine-actionable knowledge bases: the Perseus Digital Library (1987)........................................................................24 Digital Communities: Stoa Publishing Consortium (1997) ....27 Cyberinfrastructure .......................................................................29 Producing new knowledge: ePhilology ......................................30 Extending the intellectual reach of humanity: eClassics & eHumanities...........................................................................39 Bibliography ...................................................................................46 Technology, Collaboration, and Undergraduate Research ..............57 Introduction ...................................................................................57 An Audience of More Than One…...........................................61 When All the Sources Are Online ..............................................66 From Each According…..............................................................73 Shaking the Foundations..............................................................76 Conclusion......................................................................................84 Bibliography ...................................................................................85

vii

viii

CHANGING THE CENTER OF GRAVITY

Tachypaedia Byzantina: The Suda On Line as Collaborative Encyclopedia ..................................................................................89 Introduction ...................................................................................89 History of the Project ...................................................................92 Technical and Social Interfaces...................................................94 SOL and Other Projects.............................................................100 Conclusion....................................................................................104 Bibliography .................................................................................107 Exploring Historical RDF with Heml...............................................111 Introduction .................................................................................112 The Heml Data Model................................................................115 Chronology...................................................................................116 Similar Schemas ...........................................................................117 Visualizations ...............................................................................118 Areas For Improvement.............................................................121 RDF and Heml ............................................................................122 Data-Entry for HemlRDF .........................................................124 RDF-Based Nested Events........................................................124 HemlRDF and the CIDOC-CRM ............................................127 Heml's Future...............................................................................129 Projected Work............................................................................131 Conclusion....................................................................................132 Acknowledgements .....................................................................132 Bibliography .................................................................................132 Digitizing Latin Incunabula: Challenges, Methods, and Possibilities ...................................................................................135 Introduction .................................................................................136 Methods ........................................................................................138 Data Entry Methodology ...........................................................142 Possibilities ...................................................................................145 Conclusion....................................................................................147 Bibliography .................................................................................150 Citation in Classical Studies ................................................................151 Overview.......................................................................................151 Changing Technologies and the Fate of Homer's Commentators ....................................................................153 Citation as a Heuristic.................................................................155 Identification: What We Cite.....................................................156 How We Cite Objects.................................................................162

TABLE OF CONTENTS

ix

Syntax of a CTS URN ................................................................166 Beyond Citation: Architecture...................................................169 Conclusion....................................................................................170 Glossary of Technical Terms and Abbreviations...................170 Bibliography .................................................................................171 Digital Criticism: Editorial Standards for the Homer Multitext ...173 Digital Criticism: Editorial Standards for the Homer Multitext...............................................................................174 Textual Criticism of an Oral Poem in a Digital Medium......176 The Iliad and Odyssey as Oral Poetry .........................................178 Variation in the Homeric Corpus: Two Examples ................179 Representing Multiformity.........................................................182 Fluidity vs. Rigidity and a Diachronic Approach to Homeric Poetry...................................................................187 Foundational principles of the Homer Multitext ...................194 Bibliography .................................................................................197 Epigraphy in 2017 ................................................................................203 1. Background ..............................................................................204 1.1 Leiden .....................................................................................207 1.2 Digital Epigraphy Projects...................................................209 1.3 Epidoc.....................................................................................210 2. Digital Leiden...........................................................................213 3. Epigraphical Databases and Digital Publication ................217 4. The Scholar and Digital Texts...............................................219 Bibliography .................................................................................221 Digital Geography and Classics..........................................................223 The View From 2017..................................................................223 The View, Explained (and what we have left out) .................225 The Primacy of Location: A Recent Example Drawn from Google..................................................................................229 Prelude to Geographic Search: Web-based Mapping............237 Web-mapping the Geographic Content of Texts: Example of the Perseus Atlas............................................................239 The Geo-Library, the Web and Geographic Search ..............240 Big Science, Repositories, Neo-geography and Volunteered Geographic Information ...................................................244 The Electronic Cultural Atlas Initiative ...................................247 The Stoa Waypoint Database and the Register of Ancient Geographic Entities ...........................................................247

x

CHANGING THE CENTER OF GRAVITY

The Pleiades Project....................................................................250 Conclusion....................................................................................253 Bibliography .................................................................................254 What Your Teacher Told You is True: Latin Verbs Have Four Principal Parts ..............................................................................263 Introduction .................................................................................264 Benefits..........................................................................................265 A Realizational KATR Theory for Latin .................................266 The Conjugation-1 Verb laudĆre "Praise" ...............................267 The VerbA Node.........................................................................270 The Verb Node............................................................................272 Auxiliary Nodes ...........................................................................272 The Sandhi Node ........................................................................275 Strategies for Building KATR Theories...................................276 An Implicative KATR Theory for Latin .................................277 The Paradigm Chart....................................................................277 Deriving the Essence of the Paradigm.....................................282 Principal Parts ..............................................................................284 Grouping.......................................................................................288 Generating a KATR Theory......................................................290 Conclusion....................................................................................292 Acknowledgments .......................................................................293 Glossary ........................................................................................293 Bibliography .................................................................................294 Computational Linguistics and Classical Lexicography .................297 Where are we now? .....................................................................300 Where do we want to be?...........................................................302 How do we get there? .................................................................306 Word Sense Induction ................................................................306 Word Sense Disambiguation .....................................................310 Parsing...........................................................................................312 Beyond the lexicon......................................................................315 Searching by word sense ............................................................316 Searching by selectional preference ..........................................317 Conclusion....................................................................................317 Bibliography .................................................................................318 Classics in the Million Book Library .................................................323 Introduction .................................................................................325 From Curated Collections to Dynamic Corpora....................329

TABLE OF CONTENTS

xi

Services for the humanities in very large collections .............336 Fourth-Generation Collections .................................................340 The Classical Apographeme...........................................................345 Three Technical Challenges .......................................................349 Conclusion....................................................................................354 Appendix: Sample Page Images ................................................354 Primary Sources ...........................................................................354 Editions of Fragmentary Authors and Works ........................362 Reference works ..........................................................................363 Bibliography .................................................................................372 Conclusion: Cyberinfrastructure, the Scaife Digital Library and Classics in a Digital age...............................................................375 Opportunities: ePhilology and eClassics..................................378 ePhilology and Memographies ..................................................380 eClassics and Plato’s Challenge .................................................388 Classics and Cyberinfrastructure...............................................391 Services for eClassics ..................................................................395 Metrical Analysis..........................................................................400 Collections for ePhilology..........................................................403 Publication for a Cyberinfrastructure.......................................414 Archives, Libraries and Intellectual Discourse .......................414 Features of Publication in a Digital World..............................421 The Scaife Digital Library (SDL) ..............................................424 The Work of Scholarship: New Divisions of Labor in the world of Google and Wikipedia.......................................428 Conclusion: Blood for the Shades ............................................437 Bibliography .................................................................................439 Author Biographies ..............................................................................447 Index.......................................................................................................455

FOREWORD This is a book about the future, not about Ross Scaife. That's the way it should be, and that's the way he would have wanted it. For Ross was a scholar and teacher who knew in his bones that the steady, thoughtful consideration of the human past is a tool of unmatched power for informing humankind's ability to imagine and enact futures worthy of the intelligence and dignity of its every member. To honor him best, we should share in that conviction and learn from the resourcefulness and persistence of his practice. Ross's career flourished in the twin decades of the information revolution. Whatever the globalized society we share may now experience, we live ineluctably in an information society. What we invent now, what we do now will entail living out the implications of a transformation that has already happened. Through the too few years that Ross was given to shape his vision and share it with others, he kept his eye clearly on ways to make sure that the revolution in knowing will serve his profession and through it his society. His convictions were clear and luminous. The best that we can know about the past needs to be preserved and disseminated by the most powerful media available to the widest audience possible. The Stoa consortium that he led gave example repeatedly to the force of those convictions and their power to change for the better the ways we learn and think and teach. We both learned from, were inspired by, and benefited from Ross's friendship. We feel the ache of his loss deeply, but we are delighted to see in this volume an exactly appropriate response to loss: innovation, optimism, and the commitment of teachers and scholars to receiving, interpreting, and transmitting the heritage of humankind's pasts to its present and future. Ross, we think, would be glad to read this book, and then soon enough impatient to get xiii

xiv

CHANGING THE CENTER OF GRAVITY

beyond its insights to the next stage. He reminds us of another Kentuckyan of yore, Daniel Boone, who made it to the frontier ahead of the rest, and then kept uprooting and moving further west, always seeing and seizing opportunity, always staying at the leading edge. We owe him the tribute of emulation. Gregory Nagy, Harvard University James O'Donnell, Georgetown University July 2009

PREFACE This collection of essays for Ross Scaife originally appeared as “Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure” in Digital Humanities Quarterly Winter 2009: v3 n1. The volume can be viewed online at http://www.digitalhumanities.org/dhq/vol/3/1/index.html. Melissa Terras, University College London January 2010

xv

ACKNOWLEDGEMENTS The essays presented in this collection emerge from a workshop on October 5, 2007, funded by the National Science Foundation and hosted by the University of Kentucky. The official topic of the workshop was Cyberinfrastructure for Classical Studies. We chose that particular time and location, however, because we wanted to show our admiration and our love for Allen Ross Scaife, who, not only by his own decade of work with http://www.stoa.org but by his courage, his vision, and by his example as a human being, had done more than any one person to advance the field of classics in the decade that carried us from the twentieth to the twenty-first century. We cannot begin to give proper credit to all those who have made possible that work that we have been able to present here. David Packard, a generation ago, designed and built the Ibycus systems that brought the field of Classical Studies into the digital age – he pioneered textual analysis as a daily practice of scholarship and electronic publication as the standard method when these were radically new ideas. Born to resources that many princes would have envied, he developed his own mind as a scholar and as a system designer. He and his colleague William Johnson cleared the land in which classics could take root. The authors of this collection, who have sweated to plough and a bit to extend these fields can none of us fully appreciate what David, William and others accomplished but we depend each and every day on the lasting results of their earlier labors. The humanities in general and classics in particular suffered other losses as well in the year since we gathered to share our ideas and to celebrate our friend Ross. We take this opportunity to say farewell to the American Historian Roy Rosenzweig and the classicist Theodore Brunner. Each scholar tangibly shaped the ideas presented here. Roy Rosenzweig worked for decades on developing both content for and understanding of the digital world for the humanities. An American Historian, Roy pioneered and described foundational xvii

xviii

CHANGING THE CENTER OF GRAVITY

practices for all history in a digital age. His veteran confidence in human intelligence and his worldly optimism for the role of history in society sustained many of us far beyond his immediate field. One of the editors of this collection remembers well a dinner with Roy in the spring 2005 where Roy first described his most recent work analyzing the results of Wikipedia. The editor fancied himself a progressive thinker and believer in the intellectual life of society as a whole, but he had largely dismissed the notion of a community-driven encyclopedia out of hand. He simply could not in his heart believe that an unrefereed resource, produced by an intellectual laity without benefit of professional academics and with little, if any, central editorial support, could produced useful results. Roy had received similar professional training at the same institution, but Roy examined the assumptions of that education and, as so often, he looked at the evidence. He made it clear that Wikipedia was not only a useful resource but a concrete instance for a new form of intellectual production. The implications of that insight run throughout this collection. Roy always introduced ideas that had never occurred to us. Often these ideas completely upended cherished assumptions and, indeed, might have left us a bit redfaced. But Roy retained even to the end a boyish enthusiasm and lightness of heart that made it impossible to feel anything but pleasure at each keen insight, presented with such a light and joyful touch. Roy always saw how the world could become a better place. He certainly did more than observe, and the world is a better place because he was in it. The historical record will keep the name of Theodore Brunner green for as long as humanity reads the words of Homer, Sophocles, Plato and a thousand other authors. Ted founded the Thesaurus Linguae Graecae in 1972 – there are full professors in 2008 who were not yet born when Ted began this journey. Only the most powerful historical imagination could now fully imagine the kind of vision that was needed to imagine the Thesaurus Linguae Graecae. By the time Ted retired a quarter century later, he and his colleagues at the University of California at Irvine had converted virtually every classical Greek text into digital form – the TLG E Disk, published shortly after his retirement, is one of the great achievement in two and a half millennia of classical scholarship. Under the leadership of Ted Brunner, the TLG began distributing digital collections of Greek, first on magnetic tapes in the early 1980s and subsequently

ACKNOWLEDGEMENTS

xix

on the pioneering medium of CD ROM disks (with a then staggering capacity of half a gigabyte). The appearance of these texts began the field of digital classics. Dozens of classicists developed programming skills to work with this corpus. Every achievement reported in this collection depends, directly or indirectly, upon what Theodore Brunner imagined and accomplished. Several of the authors in this collection (Crane, Martin and Smith) fondly remember the hospitality of Ted Brunner and Luci Berkowitz, more than two decades ago. Those sunny days and balmy nights in southern California are long past, but we all remember the professionally produced presentation that Ted would show to his visitors. The voice familiar to American television audiences as Tony the Tiger (as Ted with delight informed his audience) would, quoting Thucydides, intone that the TLG was a possession for all time. So indeed it is and, indeed, even more – Ted created the DNA for classical Greek in a digital world. The publications in this collection were the production of a workshop, sponsored by the National Science Foundation, on the subject of Cyberinfrastructure in the Classics in September 2007 (National Science Foundation Grant number 0736476, "Changing the Center of Gravity in Classical Studies"). We gratefully acknowledge the support that we received from the University of Kentucky and the Department of Classics. In addition, we are grateful to the editorial team of Digital Humanities Quarterly for their work on this issue prior to it becoming a print publication. We therefore pass on our gratitude to Julia Flanders, Wendell Piez, Melanie Kohnen, John A. Walsh and Michelle Dalmau for work on the online edition of this volume. Gregory Crane, Tufts University Brent Seales, University of Kentucky Melissa Terras, University College London July 2009

ROSS SCAIFE (1960-2008) Allen Ross Scaife, 47, Professor of Classics at the University of Kentucky and founding editor of the Stoa Consortium for Electronic Publication in the Humanities, died of cancer on March 15, 2008 at his home in Lexington, Kentucky. Ross was born in Fredericksburg, VA on March 31, 1960. He graduated from the Tilton School in Tilton, New Hampshire in 1978 and from the College of William and Mary in 1982 with a major in Classics and Philosophy. He earned a PhD in 1990 in Classical Studies at the University of Texas at Austin. In 1988 he participated in the summer program at the American Academy in Rome, and in 1985 was awarded a Fulbright Fellowship for a year of study at the American School of Classical Studies in Athens, Greece. From 1991 to the time of his death, Ross was on the faculty at the University of Kentucky in the Department of Modern and Classical Languages, Literature, and Cultures where he taught courses on women in the ancient world, Greek art, Aristophanes, and the Greek historians, as well as Greek and Latin language courses. A pioneer in using computer technology to advance scholarship in the humanities, Ross is perhaps best known as the founding editor of the Stoa Consortium for Electronic Publication in the Humanities (http://www.stoa.org/). The Stoa, established in 1997, set the standard for Open Access publication of digital humanities work in the classics, serving as an umbrella project for many diverse projects that provide functionality, and have requirements, not supported by traditional (print) publishers. In addition to providing Open Access publication for the work of other scholars, Ross strived to make his own work (and the raw materials behind that work) available freely to others. He was the co-creator of Diotima: Materials for the Study of Women and Gender in the Ancient xxi

xxii

CHANGING THE CENTER OF GRAVITY

World and of the Neo-Latin Colloquia collection, both of which are published on The Stoa.

Figure 1. Ross Scaife taken in January 2007

According to his principled belief in Open Access, Ross was always a stern critic of models of scholarship that were needlessly exclusionary in their presentation or implementation. He firmly believed in the potential afforded by technology to bring the highest levels of scholarship to the widest possible audience, and in the

ALLEN ROSS SCAIFE

xxiii

obligation of learned societies to make their work freely available to all interested readers. Ross’s influence is most noticeable in his long-standing belief in the power of collaborative work. With humor, generosity, and a keen editor’s discretion, he worked throughout his career to build working relationships among an international circle of collaborators, for his own projects, as well as for others. As a founding editor of the Suda On Line, a web accessible database for work on Byzantine Greek lexicography, Ross helped to build a framework that allowed a large number of people to work together on a single edition. SOL was founded in 1998 at a time when such large-scale collaborative editing was rare, if not unheard of. The influence of the SOL is still being felt as the next generation of collaborative editing tools are being developed. Ross had long-term associations with Harvard’s Center for Hellenic Studies, the Perseus Project, and more recently with the Digital Classicist. Those who knew him will remember him for his generosity and willingness to offer advice, and for his ability to see connections and build bridges between projects and people. Most recently, Ross was instrumental in forging the collaboration that resulted in the high resolution digital imaging of the Venetus A, a 10th century manuscript of the Iliad located at the Biblioteca Marciana in Venice, and was a co-Principal Investigator of project EDUCE, which aims to use non-invasive, volumetric scanning technologies for virtually "unwrapping" and visualizing ancient papyrus scrolls. Since July, 2005 Ross has been the director of the Collaboratory for Research in Computing for Humanities, a research unit at the University of Kentucky which provides technical assistance to faculty who wish to undertake humanities computing projects, and to encourage and support interdisciplinary partnerships between faculty at UKY and researchers around the world. His many interests included sailing in the Northern Neck of Virginia, hunting, cooking, woodworking, and photography. Ross was the proud father of three sons, Lincoln (16), Adrian (13), and Russell (9). In addition, Ross is survived by his wife, Cathy Edwards Scaife, his parents, William and Sylvia Scaife, and three siblings, Bill Scaife, Susan Duerksen, and John Scaife, as well as their spouses and children.

xxiv

CHANGING THE CENTER OF GRAVITY

This biography was originally published at Stoa, http://www.stoa.org/?p=786. Comments in memoriam may be added there. Dot Porter, Digital Humanities Observatory

CYBERINFRASTRUCTURE FOR CLASSICAL PHILOLOGY GREGORY CRANE TUFTS UNIVERSITY [email protected]

BRENT SEALES

UNIVERSITY OF KENTUCKY [email protected]

MELISSA TERRAS

UNIVERSITY COLLEGE LONDON [email protected]

ABSTRACT No humanists have moved more aggressively in the digital world than students of the Greco-Roman world but the first generation of digital classics has seen relatively superficial methods to address the problems of print culture. We are now beginning to see new intellectual practices for which new terms, eWissenschaft and eClassics, and a new cyberinfrastructure are emerging. The Athenians grew in power and proved, not in one respect only but in all, that equality is a good thing. Evidence for this is the fact that while they were under tyrannical rulers, the Athe-

1

2

CHANGING THE CENTER OF GRAVITY nians were no better in war than any of their neighbors, yet once they got rid of their tyrants, they were by far the best of all. This, then, shows that while they were oppressed, they were, as men working for a master, cowardly, but when they were freed, each one was eager to achieve for himself. (Herodotus 5.78, tr. after Godley)

I am no sculptor, to make statues fixed motionless on the same pedestal. Go, sweet song, on every merchant-ship and rowboat that leaves Aegina, and announce that Lampon's powerful son Pytheas [5] won the victory garland for the pancratium at the Nemean games. (Pindar Nemean 5.1-5, tr. after Diane Svarlien)

The first passage above follows a military encounter in which the Athenians show, for the first time, that terrible energy which would (at least according to our Athenian sources) fascinate and unnerve the rest of fifth-century Greece. Students of classical Athens have for millennia contemplated the energy that liberation released — Herodotus’ wonder has echoed ever since and served as one motivation for human fascination with Athens and its achievements1. The early years of the twenty-first century have seen a heroic age for intellectual life. Ideas have poured across the world and new minds have joined the professionalized academics and authors in grappling with the heritage of humanity. Often rough and unpolished, unconcerned with the niceties of convention, a new generation of digital entities has exploded across human society, creating wikis, blogs and millions of electronic resources. Plato’s Socrates scorned writing itself on the grounds that the written word was as powerless to answer our questions as a mute painting (Plat. Phaedrus

The publications in this collection were the production of a workshop, sponsored by the National Science Foundation, on the subject of Cyberinfrastructure in the Classics in September 2007 (NSF GRANT INFO). We gratefully acknowledge the support that we received from the University of Kentucky and the Department of Classics. 1

CRANE, SEALES AND TERRAS

3

275d). Now each question becomes a challenge as active readers probe relentlessly the sprawling information space beneath their fingers. Wikipedia has demonstrated a new form of intellectual production that challenges the assumptions that many of us internalized in graduate school about how knowledge can be described and ideas shared.2 The scale of projects such as Wikipedia deserves serious reflection: the English Wikipedia has, as of summer 2008, more than 2.4 million entries.3 By one estimate, Wikipedia has absorbed 100 million hours of labor — put another way, Wikipedia has, if measured by the labor invested, become a billion dollar project (Shirky 2008). Changes go beyond traditional academic channels. The 9/11 attacks in 2001 were the last major event owned by the centralized 20th century media. With the Tsunami and the 7/7 London bombings, we had shifted "from the broadcasters owning the story, to the people involved in the events owning their own stories and spreading it to those who they know and care about, using their own communication channels."4 Conventional streams of refereed publications (such as this collection) are necessary but insufficient — this introduction has already cited Wikipedia, a blog and the video for a presentation at a conference. We cannot make the decisions that we need to live in the world around us without constantly evaluating information that has no conventional academic pedigree. Every anxious editorial fretting about undomesticated ideas prowling through an internet jungle underscores the urgent need for that critical thinking that we in the liberal arts claim to instill. The internet may prove to be the

For some evaluations of the Wikipedia phenomenon and the challenges it has offered to more conventional forms of intellectual production, see for example, (Lally 2007) and (Rosenzweig 2006). 3 Statistics retrieved from http://www.wikipedia.org, accessed August 2, 2008. 4 http://wealthofnetworks.wordpress.com/, a blog by Margaret Gold that contains summaries of John Dartington’s remarks at a conference entitled, "The Wealth of Networks: Digital Economies and the Next Generation Internet," held in the UK in July 2008. 2

4

CHANGING THE CENTER OF GRAVITY

best thing for humanities education since the rulers of early modern Europe found that classical training provided them with the administrators with whom to build strong nation states. No field of study is poised to benefit more than those of us who study the ancient Greco-Roman world and especially the texts in Greek and Latin to which philologists for more than two thousand years have dedicated their lives. Our predecessors worked in Alexandria, Damascus and Baghdad as well as Berlin, Oxford and Venice. Many lived in states whose names we may never have heard. Most spoke languages like Syriac or the dialects of medieval Europe, which have themselves passed into history. They preserved the battered remains of the past in isolated monasteries and the libraries of aristocrats. They raised capital and set type, then sent Greek and Latin texts coursing through Europe and then the world. They convinced the powerful that the study of Greek and Latin would provide the supple and disciplined minds needed to fashion, maintain and expand the nation states of Europe and their empires across the world. And in the twentieth-century, as other disciplines emerged as filters to identify the promising and send them on their way to worldly privilege, classicists carried their field forward, opened up their curricula to those who had not learned Greek and Latin and, from the margins of the intellectual world, continued their researches on the texts that they loved The papers in this collection reflect a new generation of classicists — entrepreneurial in their disruptive actions, impatient of convention, hunting for new methods to understand and to disseminate those ancient texts to which they, like dozens of generations before them, have dedicated their lives. It is hard to predict what the future holds for the intellectual practices and products of twentieth century print-based classical studies. In the opening of his fifth Nemean Ode Pindar reveled in the speed and reach of the written word: his songs could be copied and race across the known world in the largest ship and the smallest boat, while the grandest statues remain fixed and mute upon their pedestals. The texts of antiquity, freed from the tyrannical limitations of expensive print publication, preserved in multiple servers across the globe, flash instantaneously anywhere that the internet can reach — hundreds of millions of desktops and mobile devices. Homer, Plato, Virgil, Cicero — they all reach more of humanity than ever was conceivable in the millennia since they set down their styli for the last time

CRANE, SEALES AND TERRAS

5

and passed into dust. And it is not just physical access — we already can, with simple links between source text and its commentaries, translations, morphological analyses and dictionary entries, provide a better reading environment than was ever conceivable in print culture. We know from the readers of our web sites that texts in Greek and Latin, of many types, now fire the minds to which twenty years ago they had no access. And if this reading environment now supports those proficient in English, we can already design libraries that will, within a reasonable period of time, support readers in the less commonly spoken languages of the European Union such as Croatian and Hungarian and widely spoken languages such as Arabic and Chinese.

TERMS AND CONTINUITIES WISSENSCHAFT AND PHILOLOGY As to the speeches which were made either before or during the war, it was hard for me, and for others who reported them to me, to recollect the exact words. I have therefore put into the mouth of each speaker the sentiments proper to the occasion, expressed as I thought he would be likely to express them, while at the same time I endeavoured, as nearly as I could, to give the general purport of what was actually said. [2] Of the events of the war I have not ventured to speak from any chance information, nor according to any notion of my own; I have described nothing but what I either saw myself, or learned from others of whom I made the most careful and particular enquiry. [3] The task was a laborious one, because eyewitnesses of the same occurrences gave different accounts of them, as they remembered or were interested in the actions of one side or the other. [4] And very likely the strictly historical character of my narrative may be disappointing to the ear. But if he who desires to have before his eyes a true picture of the events which have happened, and of the like events which may be expected to happen hereafter in the order of human things, shall pronounce what I have written to be useful, then I shall be satisfied. My history is an everlasting possession, not a prize composition which is heard and forgotten. (Thucydides. 1.22, tr. Jowett)

6

CHANGING THE CENTER OF GRAVITY

The distinction between science and the humanities reflects particular traditions of the English speaking world. In German, for example, Wissenschaft includes all systematic intellectual work — we need to specify Naturwissenschaft or Geisteswissenschaft if we want to distinguish between the natural sciences and the humanities. The term Altertumswissenschaft describes the systematic analysis of the past, including both the textual and the material record. We thus use the term Wissenschaft to describe the output of the systematic study of antiquity as it appears in material forms such as articles and monographs, plans and maps, images and diagrams, editions and reference works. Whether or not we believe that we can reconstruct aspects of the ancient world as they actually were, we develop our ideas on the basis of primary and secondary sources stored in material form. For the purposes of this introduction, philology describes the production of shared primary and secondary sources about linguistic sources, while classical philology focuses upon classical Greek and Latin, as these languages have been produced from antiquity through the present. The famous passage from Thucydides, quoted above, is relevant for several reasons. First, Thucydides was one of the first to apply systematic methods to represent in textual form, as accurately as he could, the events of the past — his history of the Peloponnesian War has been a model for Wissenschaft. Second, Thucydides used writing as a medium to disseminate his ideas, but he drew upon every source available, including eyewitness interviews, archaeological remains, and the textual record. Third, Thucydides’ words seek to represent an entire world — we cannot fully study Thucydides without engaging as well with the material record. Nor is this material record simply a source with which to illustrate the topics that Thucydides has included. We need to develop the fullest possible understanding of the material record in order to develop our own understanding of how Thucydides represents his subject. The terms eWissenschaft and ePhilology, like their counterparts eScience and eResearch, point towards those elements that

CRANE, SEALES AND TERRAS

7

distinguish the practices of intellectual life in this emergent digital environment from print-based practices.5 Terms such as eWissenschaft and ePhilology do not define those differences but assert that those differences are qualitative. We cannot simply extrapolate from past practice to anticipate the future.

CLASSICS AND THE HUMANITIES Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who [274d] invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters. Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved [274e] or disapproved. The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, This invention, O king, said Theuth, will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered. But Thamus replied, Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; [275a] and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess. For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust

5 For a discussion of ePhilology and its role in the larger cyberinfrastructure environment, please see (Crane 2007); also, (Dimitriadis 2006).

8

CHANGING THE CENTER OF GRAVITY in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem [275b] to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise. (Plato, Phaedrus 274c-275b, tr.?)

Those of us who grew up hearing that we should read more and that television had damaged our minds may smile when we hear Plato’s Socrates two and a half millennia ago criticizing the written word for damaging our minds. In the early twenty-first century, complaints have emerged about the look-up culture of Google and ubiquitous connectivity.6 Nevertheless, the basic point remains valid, even if the media change. We must augment our biological memories by using material records, whether these are hand-written, printed or digital but external information can only augment internalized knowledge. We can only experience humor, for example, if we understand the joke as it happens. We can work our way through a Greek text, looking up every word in a dictionary and using modern translations to orient ourselves, but we will not understand the text in the same way as we would if we could understand the language fluently. And, even if we understand the Greek words and grammar, we will hear more from those words the more we have thought about Plato, the philosophical concepts that form the subject of his dialogues, and the culture in which he lived. Thucydides set out to express in material, written form a record of the past that would last forever. Plato questions the value

6 For example, Jeffrey Garrett discusses whether the use of Google and full text searching is being substituted by some for deeper reading and analysis, see (Garrett 2006); a recent report by the British Library and JISC has explored how Google and the internet has influenced the younger generation of searchers, see Information Behavior and the Researcher of the Future. January 11, 2008. Joint Report funded by the BL/JISC.

CRANE, SEALES AND TERRAS

9

of any written record except insofar as that record finds full expression in human minds. We already live in a world where the books have begun to talk with each other.7 When data mining systems detect fraudulent activity on our credit cards, they do a better job of finding significant patterns than could human analysts alone — if there were human analysts to sift through trillions of transactions. Financial institutions do not care how they identify fraud because fraud detection is a means to an end. Text mining can detect words and phrases that are unusual in Plato.8 We can even imagine syntactic analyzers that can not only parse every surviving Greek and Latin word but that might at some point be better able to justify its decisions by pointing to other similar patterns in that vast corpus than has ever been possible for any human reader. But such information would only realize its full value if it becomes knowledge in a living human mind and allows a reader to see something that would not otherwise have been visible.9 For the purposes of this discussion, we use the terms classics and the humanities to describe that focus upon internalized knowledge and intellectual practices designed to help us perceive new connections and increasingly sophisticated patterns not only in the texts that we read but in the images that we see and the sounds that we hear. Human beings are the measure of all things in the humanities. Philology truly matters insofar as it serves classics and its goal of bringing classical Greek and Latin to life in the minds of human beings.

For more on this theme, see (Crane 2005) and also (Kelly 2006). Text mining within the humanities and within classics has received a fair amount of attention in recent years, for example, see (Plaisant 2006); (Don 2007); and (Hyman 2008). 9 Matthew Kirschenbaum has offered a useful overview of how text and data mining are reshaping reading in the digital environment, see (Kirschenbaum 2007). 7 8

10

CHANGING THE CENTER OF GRAVITY

INFRASTRUCTURE Tell me now, you Muses that have dwellings on Olympus— [485] for you are goddesses and are at hand and know all things, whereas we hear but a rumor and know nothing—who were the captains of the Danaans and their lords. But the common folk I could not tell nor name, no, not though ten tongues were mine and ten mouths [490] and a voice unwearying, and though the heart within me were of bronze, did not the Muses of Olympus, daughters of Zeus that bears the aegis, call to my mind all who came beneath Ilios. Now will I tell the captains of the ships and the ships. (Homer, Il. 2.484-493, tr. after A. T. Murray)

Infrastructure provides the material instruments whereby we can produce new ideas about the ancient world and enable other human beings to internalize those ideas. Infrastructure includes intellectual categories (e.g., literary genres, linguistic phenomena, and even the canonical book/chapter/verse/line citation schemes whereby we cite chunks of text), material artifacts such as books, maps, and photographs, buildings such as libraries and book stores, organizations such as universities and journals, business models such as subscriptions, memberships, and fee simple purchases, and social practices such as publication and peer review. Our infrastructure constrains the questions that we ask and our sense of the possible. Thus, the Homeric narrator rules out the idea of representing the names of every hero who participated in the Trojan War. The twenty-first century fan of American baseball can, by contrast, locate not only the name but the basic statistics recorded for every person who ever threw a pitch or swung the bat in a major league game. By the classical period, we begin to find lists of citizens, office-holders, temple dedications, tribute paid and similar categories. Thucydides drew upon textual, archaeological and verbal sources and he could leave behind a written text to which he had attached his own name, but there were no libraries in the modern sense. He could not cite transcripts of public speeches in a con-

CRANE, SEALES AND TERRAS

11

gressional record or even a New York Times article. He could not footnote official documents in a classical Greek equivalent to the Official Records of the Union and Confederate Armies (United 1880).10 There were no recordings of those who survived to describe civil war in Corcyra or the Sicilian Expedition. He could not publish pictures or even expect that diagrams would be faithfully reproduced over time. A stream of words was the only medium by which he could represent his chosen subjects. Infrastructure is so fundamental that it may become invisible to us but the resulting blindness makes us confuse the limits that we face with our larger goals. In periods where our infrastructure advances incrementally, we may take it for granted. Infrastructure does not simply affect the countless costs/benefit decisions we make every day — it defines the universe of what cost/benefit decisions we can imagine.11 All the tribute of the Athenian empire could not have paid for one color photograph of Pericles. Rarely, if ever, can we predict the full implications of relatively modest technological change. Gutenberg did not think that, in using movable type to print a Latin bible, he was creating a technology to make translations of the bible ubiquitous, enable new forms of Christian worship and facilitate revolutionary change. But even if we cannot foresee the future with perfect clarity, we must constantly reexamine the goals that we choose to pursue today in the light of what is already possible. Before shifting to the digital infrastructure already taking shape and its implications for current practices in classical philology, we should review what has and has not changed for classical philology as the core information infrastructure of human life as a whole has shifted, decisively and irrevocably, from atoms to electrons.

10 Cornell University has published electronic versions of this series on-line as a part of the Making of America Digital Library: http://cdl.library.cornell.edu/moa/browse.monographs/waro.html (last accessed August 12, 2008). 11 Several recent reports have called for expanding our ideas of infrastructure in order to create a larger cyberinfrastructure, see (Arms 2007) and (ACLS 2006).

12

CHANGING THE CENTER OF GRAVITY

CLASSICS IN 2008 I shall begin with our ancestors: it is both just and proper that they should have the honor of the first mention on an occasion like the present. They dwelt in the country without break in the succession from generation to generation, and handed it down free to the present time by their valor. [2] And if our more remote ancestors deserve praise, much more do our own fathers, who added to their inheritance the empire which we now possess, and spared no pains to be able to leave their acquisitions to us of the present generation. [3] Lastly, there are few parts of our dominions that have not been augmented by those of us here, who are still more or less in the vigor of life; while the mother country has been furnished by us with everything that can enable her to depend on her own resources whether for war or for peace. (Pericles’ Funeral Oration: Thucydides. 2.36.1-3)

Classicists can identify with the Athenian audience of Pericles’ Funeral Oration — at least, the oration that Thucydides presents to us. We do not, like the Athenians, like to say that our ancestors were sprung from the dirt and our ancestors have not inhabited the same small rocky peninsula since they were sprang from the earth — classicists have come from countries and periods far beyond the experience of any classical Greek. Our field has an ancient history but we have begun to expand, like the Athenians of fifth century Athens, into a much larger space than we ever could occupy before. The digital world has become our sea, but our empire offers freedom, and the natural borders that will contain our field are nowhere to be seen. Much as we may have achieved, we are still as a field in the incunabular phase of development, more focused upon the problems of the past than the opportunities of the present.12 Classicists were among the first humanists to exploit digital technologies and enjoy a reputation as being arguably the most digitally advanced field. Certainly, classicists were, as a field, early adopters. If one includes the study of any Greek and Latin texts

12

For more discussion on this topic, please see (Crane 2006a).

CRANE, SEALES AND TERRAS

13

under Latin, Father Busa’s famous concordance of Thomas Aquinas, produced with the help of IBM in the late 1940s, would constitute the start of digital classics (see (Busa 1974) and (Busa 1980). If we restrict ourselves to the Greek and Latin authors commonly taught in classics departments of the 20th century, then we must move twenty years forward to the late 1960s. Full professors of classics today have been born after David Packard, who working in the basement of the Harvard Science Center digitized the text of Livy. There are classics majors who received their undergraduate degrees in the spring of 2008 who were born after the Perseus Digital Library began serious work in the late spring of 1987. Not only are virtually all publications — whether distributed in print or not — produced digitally, but digitized textual corpora, digital versions of printed secondary sources, electronic reviews, bibliographic databases, and web sites are all standard elements of our work.13 Two leading departments of classical philology have even discovered the value of the preprint servers on which some of the most demanding areas of research have depended for more than fifteen years.14 The early use of digital tools in classics may, paradoxically, work against the creative exploration of the digital world now taking shape. Classicists grew accustomed to treating their digital tools as adjuncts to an established print world. Publication — the core practice by which classicists establish their careers and their reputations — remains fundamentally conservative. While we may congratulate ourselves on the innovative content of what we write and while we will always need publications that articulate particular arguments at a particular point in time in a particular voice, the for-

13 For an overview of how many classicists use digital materials as evidenced by citations, see (Dalbello 2006). 14 (Pritchard 2008). The papers for a 1995 workshop by the American Physical Society, online at http://publish.aps.org/EPRINT, include talks from the previous year about preprints. The ArXiv.org server founded in 1991, contained (as of June 29, 2008) 484,758 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology and Statistics.

14

CHANGING THE CENTER OF GRAVITY

mat of our publications is essentially the same as that which Gibbon used in the 18th century.15 While the documents were digital in form, almost none of their content was machine actionable: strings such as "Thuc. 1.38.2" had not been analyzed and converted into machine actionable links to the text of Thucydides, book 1, chapter 38, section 2; a reference to Thucydides did not have associated with it any information whereby an automated system could reliably determine whether this Thucydides was the historian or one of the various other figures by this name; quotations of Greek and Latin authors were not dynamically linked to multiple online editions, nor did they carry with them links to any linguistic apparatus (textual notes, dictionaries, grammars, commentaries, translations) not offered by the author of the articles. While these articles may be online, the main bibliographic resource for classical studies, L’Année Philologique, still relies upon manual summaries to index and disseminate these articles in its digitally disseminated bibliography. Nor can the reader, of course, see what later articles cite earlier publications. We can add each of the features listed above to existing documents automatically with reasonable accuracy — simple text search provides functionality that is increasingly comparable to the manually produced indices on which we had to rely in print culture.16 Google has already popularized the ability to identify and disambiguate place names and to find quotations embedded in unstructured text — automatically generated maps became a standard feature of Google Books in 2007 and frequently quoted passages

Classics is not the only field that has been challenged to modernize its publication system, the issue of scholarly communication and the need for major change has been the subject of much discussion, it recently served as the topic for the Winter issue of the Journal of Electronic Publishing. 16 While manually created indices such as back-of-the-book indexes are still considered essential by many, the automatic creation and remodeling of such indices is a growing research area, see (Csomai 2006) and (Chi 2007). 15

CRANE, SEALES AND TERRAS

15

soon followed.17 Particular domains may need to adapt general services to their needs: classicists need Optical Character Recognition (OCR) systems that can not only provide useful results for classical Greek but can also recognize Latin and do not helpfully convert tu-m (a Latin word for "then") into English t-u-r-n.18 Scholarly disciplines need page layout analysis systems that can recognize and parse not only general document formats such as notes at the bottom of the page, and the individual entries of indices, encyclopedias, and lexica, but also specialized document formats such as the commentary and textual notes.19 Scholarly disciplines such as classics need specialized named entity searches: we need to determine not only whether "Th. 1.38" is a citation to a primary source but also, if so, whether it designates Thucydides, book 1, chapter 38, Theocritus, Idyll 1, line 38 or some other text. The production of these services is the most important task for classics and for any scholarly discipline which does not focus solely upon the contemporary English-language, mass market American culture which the Web of 2008 primarily serves. While we may need to support less and less software, we will then only shift our efforts to the production and refinement of the knowledge sources which support general services: we need machine actionable reference works that can help general services run by giants such as Google to distinguish one Antonius or one Alexandria from another.20

For further information on the Google Books system, see (Kolak 2008) and (Schilit 2008). 18 The adaptation of commercial OCR systems for Greek and Latin as well as the development of other text recognition systems have been explored by several research groups; see for example (Gatos 2006) and (Moalla 2006). 19 Specialized document layout analysis systems for historical documents has been an active research field for years, for a recent overview see (Sankar 2006) and for some recent work in this area involving texts digitized by the Open Content Alliance, see (Lu 2008). 20 We have reported on our own work in historical named entity recognition in (Crane 2006c) and (Smith 2001); for several examples of the 17

16

CHANGING THE CENTER OF GRAVITY

Classicists of the 20th century built their work upon a foundation that took shape in the 19th century. In the last decades of the twentieth century, ambitious classicists began to shift their efforts away from infrastructural tools such as editions and commentaries. Instead they turned towards articles and expository monographs on topics often derived from their colleagues in the Modern Language Association. The Pax Stereotypica of the 20th century has, however, collapsed. We live in a digital age in which we need to rethink our most fundamental resources -- we are reinventing the forms and functions of our editions, lexica, encyclopedias, commentaries, grammars, bibliographies and every other textual category that evolved in a print ecosystem. And as we feel our way forward, we need to rebuild our entire infrastructure. In a primarily print world, we can turn to digital tools for documents that contribute at the margins — e.g., digital scholia for a major classical author. In the digital world, we want the scholia but we also need editions of our canonical authors. The Editiones Principes Electronicae for every major author are still waiting to be produced. A new generation of editors spreads across a new and uninhabited world in which they can acquire for themselves the digital kleos aphthiton ("undying fame") that the pioneers of Hellenistic Alexandria and early modern Europe earned for themselves. The greatest barrier that we now face is cultural rather than technological. We have all the tools that we need to rebuild our field, but the professional activities of the field, which evolved in the print world, have only begun to adapt to the needs of the digital world in which we live — hardly surprising, given the speed of change in the past two decades and the conservatism of the academy. Perhaps the most important point of continuity — and the greatest reason why publication in classics has adapted so little to the digital world — appears before we even begin reading publications. An informal survey reveals that forty of forty-one classics publications available online from Johns Hopkins University Press

growing research in this area, see (Geleijnse 2007) and (Borin 2007) and (Tobin 2008).

CRANE, SEALES AND TERRAS

17

(97.5%) are products of a single author — the only exception was an archaeological publication in Hesperia, the journal of the American School at Athens.21 While expanding this survey would provide greater statistical certainty, the conclusion would be the same: classicists in 2008 devote most of their energies to individual expressions of particular arguments. An even more problematic issue is that the editions, commentaries, grammars, lexica, and other elements of scholarly infrastructure have not adapted in any significant way to the digital world.22 In the five centuries since the first printed editions of classical texts began to appear, print culture assembled an immense amount of intellectual capital with which to support thinking about Greek and Latin texts. This knowledge must, however, be converted into a machine actionable form.23 Converting this intellectual capital from

This informal survey examined the articles in sample issues that Johns Hopkins made publicly available for marketing purposes. Where there was not a public issue, the most recent online issue was examined. Seven single author articles in http://muse.jhu.edu/demo/american_journal_of_philology/: vol. 126 (1) 2005; five single-author articles in http://muse.jhu.edu/demo/arethusa/: vol. 38 (1) 2005; four single-author articles in http://muse.jhu.edu/demo/classical_world/: vol. 99(1) 2005; http://muse.jhu.edu/demo/helios/: vol. 34 (1) 2007; nine single-author articles in http://muse.jhu.edu/journals/journal_of_late_antiquity/toc/current.html : vol. 1 (1) 2008); two single-author articles in http://muse.jhu.edu/journals/mouseion_journal_of_the_classical_associ ation_of_canada/toc/mou.7.1.html: vol. 7(1) 2007; ten single-author papers in http://muse.jhu.edu/demo/transactions_of_the_american_philological_a ssociation/: 135(1) 2005; three single-author papers in http://muse.jhu.edu/demo/hesperia/: 71(1) 2005. By contrast, there was only a single multi-authored paper in this group: (Kraft 2005). 22 For further discussion of this issue, see (Crane 2006b). 23 Research into how reference works can be made machine actionable has been investigated by (Veltman 1999) and (Buckland 2007). Other interesting work has examined how less traditional reference sources such as Wikipedia can be turned into knowledge bases, see (Ponzetto 2007). 21

18

CHANGING THE CENTER OF GRAVITY

human readable print to machine actionable knowledge is both fundamental and complex: we need to convert statements such as "facio, facere, feci, factum" into something that a morphological analyzer can use to recognize a form fecisset as the pluperfect form of the verb facio; we need to mine from a set of encyclopedia articles the data that will allow us to search primary and secondary sources alike for one among dozens of historical figures named Antigonus; we need grammars and lexica that provide not only a handful of examples but that can also locate the phenomena that they describe in any corpus of Greek or Latin; we need editions that can tell us precisely, how and how often they differ from another and which previous editions and/or manuscript witnesses they follow most closely. More than fifteen years ago the Text Encoding Initiative (http://www.tei-c.org/index.xml) was circulating methods with which to create machine actionable editions that can support advanced services and, more importantly, can be updated and maintained over time (Sperberg 1994).24 The process was an open one that invited participation from scholars in Europe and North America. Any editor developing a capital resource such as a text, designed to serve an intellectual community for decades to come, had an opportunity to learn how to design a digital edition that could be printed in the short term and then maintained — and even updated — over time.25 In the fifteen years that have passed since the TEI documented how to produce digital editions, a new generation of scholars has passed from secondary school to the faculty, but all of the new editions of classical authors still appear as

24 Early versions of these guidelines were circulating at least as early as 1990. For an example of current technology available to manage properly structured textual data, see (van den Branden 2007). 25 A variety of approaches to designing digital editions have been developed over the years, many based on the TEI, for several (but by no means exhaustive) examples, see (Audenaert 2008); (Dekhytar 2006); (Riva 2005).

CRANE, SEALES AND TERRAS

19

static print documents, the rights sold to commercial publishers.26 If the electronic files were freely available, they would be of limited use because their authors did not follow the guidelines that the TEI published. Classicists have relied for the most part on the Thesaurus Linguae Graecae (TLG) to provide searchable versions of the reconstructed texts that have appeared — without the introductions, textual notes, indices or other scholarly apparatus available in any digital form. Converting digital editions to print is a particularly messy task. Editors often do not repeat in the textual note the precise passage to which the textual note applies — they assume that their human readers will be able to make these connections themselves. In a recent study, Federico Boschetti applied a range of techniques with which to associate the notes in a textual apparatus with the appropriate place in the text. He found that these techniques could correctly associate only about 80% of the textual notes with the text to which they referred (Boschetti 2007). This does not even address the task of analyzing the content of the textual notes so that we can then pose queries such as "where does MS P differ from V by using the same grammatical form but P and V use different dictionary words," "visualize the evolution of the text of Aeschylus, allowing me to see how each edition differs from those which precede it, which editions are most closely related to one another and which editions have been most influential," or "which variants have the biggest apparent impact on the text based on a range of criteria." The articles in this collection reflect the most recent stage in the evolution of digital classics and point to the future, but to appreciate that future, we need to review major developments on which that future builds. These articles point forwards to an emergent Cyberinfrastructure, but this Cyberinfrastructure builds upon three earlier stages of digital classics: incunabular projects, which retain the assumptions of print culture, knowledge bases produced

26 As often in the history of scholarship, New Testament scholars have, by contrast, pioneered the use of information technology, see P. Robinson’s work for example (Robinson 2000); (Robinson 2005).

20

CHANGING THE CENTER OF GRAVITY

by small, centralized projects, and digital communities, which allow many contributors to collaborate with minimal technical expertise.

DIGITAL INCUNABULA: THE THESAURUS LINGUAE GRAECAE (1972) Digital incunabula are forms that replicate the established forms of print. Thus, the TLG was, in the early 1970s, designed as a gigantic, infinitely flexible concordance. Its texts capture the basic page layout and canonical citations of the original editions, and a sample search of it is illustrated in Figure 1.1. The Bryn Mawr Classical Review (http://bmcr.brynmawr.edu/) has been successful because it used forms such as email and then the Web to produce traditional reviews that any classicist could produce and read. The digitized publications in JSTOR (http://www.jstor.org/), Project Muse (http://muse.jhu.edu/), and Google Books (http://muse.jhu.edu/) provide new methods by which to search and disseminate knowledge, but the ultimate objects of exchange are facsimiles of exchange. These projects tend to require either very large or very small capital investments. They focus on producing, as quickly as possible, the same intellectual objects to which their communities are already accustomed. In this stage of work, catalogues may grow far more elaborate — the TLG and JSTOR allow us to search all the words in primary and secondary sources, while Google dynamically generates maps of places and lists of frequently quoted passages automatically extracted from its image books. All of these projects provide, in effect, a new generation of catalogues where the books remain unchanged. The system designers do not want to get bogged down in the specifics of any particular domain, while the domain experts do not want to get bogged down in the technology.

CRANE, SEALES AND TERRAS

21

Figure 2.1: A search of the TLG digital library containing 100 million words of classical Greek texts. First begun in 1972, the TLG provides word searches of various types that deliver excerpts of text that mirror print sources — even the hyphens are retained. The most important contributions of the TLG are (1) very accurate transcriptions of the text (without textual notes, introduction, indices etc.) and (2) encoding one canonical citation scheme by which scholars cite these sources. Incunabular systems have themselves evolved. Storage has grown so much less expensive (by one measure, at least 300,000 times cheaper27) that more recent systems assume page images of the original are available. The representative of one national library as-

27 See the discussion of storage costs in 1982 below. The TLG was founded ten years earlier, in 1972, when disk storage itself had just begun to emerge.

22

CHANGING THE CENTER OF GRAVITY serted that it would not even accept collections of transcribed text without images of the original pages.

Incunabular systems have been under development for a long time — there are tenured professors of classics who were born after the TLG began work in 1972. Figure 1.2 illustrates the generation of incunabular systems that emerged in the 1990s with a sample text from the Open Content Alliance (OCA, http://www.opencontentalliance.org/), whereas Figure 1.3 illustrates a sample from Google Books. Where the TLG provides a fully transcribed version of source texts, the OCA, Google Books and other projects provide only scanned page images and such text as OCR software can generate. These projects provide noisier — and, in the case of Greek, no — searchable text, but they index all of the text on the page, and their accuracy will increase as OCR software becomes more sophisticated.28 Also, projects such as the OCA provide open-content licenses and encourage third parties to download and repurpose the scanned page images. Thus, the Mellon-funded Cybereditions Project is creating within the OCA an open source library of Greek and Latin critical editions, on which advanced services can be built. The scanned editions, though simple in form, provide a foundation on which more sophisticated digital objects can be built: no license will later pull these image books out of circulation and no license restricts the ways in which they can circulate.

Google has sponsored development of OCRopus, an open-source document analysis and OCR system in order to promote development of more sophisticated OCR technologies, http://code.google.com/p/ocropus/. 28

CRANE, SEALES AND TERRAS

23

Figure 1.2. Twenty-first century incunabular publications such as the books digitized by the OCA are designed not only to provide useful services in the present but to be integrated into more sophisticated services over time. The digitized collection of fragmentary Greek historians above will be joined by a digital edition that builds upon, precisely references and extends the content of the print edition. Such composite editions are part of the fourth-generation collections described in Classics in the Million Book Library (Crane et al., in this collection).

In the incunabular stage, if you retrieve a book in a language that you cannot read or on a topic that you cannot understand, then it is your responsibility to find a translation and any other background information you may need to make sense of what is before you. In the incunabular stage, the center of computation is external to the document, emphasizes general algorithms and depends upon little, if any, domain specific machine actionable knowledge. In incunabular projects, the physical distance between readers and publications dissolves.

24

CHANGING THE CENTER OF GRAVITY

Figure 1.3. A commentary on Thucydides as seen in Google Books in July 2008. Note that the general OCR engine has begun to provide output for Greek print that, while still far from perfect, is searchable and comprehensible to an expert reader. Google Books does not, of course, understand the citation scheme by which scholars can cite Thucydides but it has recognized the title page and the index, and it has recognized a page with a map as something of interest.

MACHINE-ACTIONABLE KNOWLEDGE BASES: THE PERSEUS DIGITAL LIBRARY (1987) These kinds of projects, unlike incunabular projects, set out to create knowledge about a particular domain that machines can manipulate and that begin to move beyond the forms of print. In classics, the Perseus Project provides an example of such systems. Perseus set out, in the middle 1980s, to build an environment where knowledge about the ancient world, including both the ma-

CRANE, SEALES AND TERRAS

25

terial and textual record, could be dynamically recombined to support new forms of inquiry. Figure 1.4 illustrates a sample text as it appears in the Perseus Digital Library.The focus of Perseus was to create resources that were in print either impractical in print (e.g., producing dozens or hundreds of high resolution color images of for thousands of Greek vases) or impossible (e.g., interactive tours of archaeological sites and searching/browsing services based on automated morphological analysis of Greek and Latin).29 Semantic text markup is a characteristic feature of such projects: rather than simply recording that a word is, for example, in italics, these systems try to interpret the content and thus to record whether the italics indicate rhetorical emphasis, the title of a literary work, a word quoted from a foreign language, or some other category.30 As these systems grow more intelligent, they convert an increasing portion of the content inside the books into well-structured information that machines can process. These systems depend upon individuals who understand the evolving relationship between the possibilities of technology and the needs of the discipline.31

For a list of publications describing this work, please see http://www.perseus.tufts.edu/hopper/about/publications/. 30 The importance of semantic markup for digital library texts has been discussed for many years, particularly the issue of potential semantic interoperability of such markup or metadata, for two examples see (van 2006) and (Elings 2007). 31 We have previously described this role as that of corpus editors, see (Crane 2000). 29

26

CHANGING THE CENTER OF GRAVITY

Figure 1.4. The figure above illustrates some of the information about the opening lines of the Odyssey available in the Perseus Digital Library. First, documents in this collection have markup illustrating their logical contents: thus, where incunabular systems can only recognize the physical page divisions, the knowledge base allows the digital library system to recognize for dictionary words in the LSJ Greek-English lexicon the many separate entries that appear within a single page or that begin on one page and end on another. Second, the primary source citations have been automatically analyzed and encoded. Thus, the system can take a chunk of Greek, recognize what lines it contains, and then locate dictionary entries (or commentaries, encyclopedias, articles etc.) that refer to the lines in the chunk displayed. In a mature digital library, citations from one text to another become bi-directional links, allowing readers not only to follow the documents that a particular work cites, but also to find works that subsequently cite the document that they are viewing. Third — and perhaps most importantly, morphological knowledge has been represented in machine actionable form. Thus, an automated system is able to recognize that the string ‫ۆ‬ƭƭƥưƥ is a form associated with the dictionary entry ‫ۂ‬ƭ‫ܝ‬Ƴưƹ.

CRANE, SEALES AND TERRAS

27

Reference materials, in particular, are structured to support automatic systems (e.g., the morphological analyzer learns Greek and Latin morphology from a machine actionable grammar) and to be decomposed into small chunks and then recombined to provide dynamic commentaries. If you retrieve a book in a language that you cannot read or on a topic that you cannot understand, the system can find translations where these already exist, machine translation and translation support systems, reference works, and general background information suited to the general background and immediate purposes of the reader. In knowledge bases, the boundaries between books begin to dissolve.

DIGITAL COMMUNITIES: STOA PUBLISHING CONSORTIUM (1997) Knowledge bases such as Perseus were (and, to a large extent still are), produced by small teams of experts who bridge the gap between the technology and individual disciplines to make documents and the ideas within them intellectually as well as physically more accessible. Digital communities enable more people to participate in more ways and in on-going, dynamic forms. New forms of publication such as wikis, blogs, and various websites open up new instruments with which individuals and groups can contribute in an on-going, dynamic fashion.32

32 The phenomenon of digital communities and the new ways in which individuals can contribute to them has been extensively explored, for some recent work, see (Cosley 2006); (Krowne 2003).

28

CHANGING THE CENTER OF GRAVITY

Figure 1.5 The Suda On Line (described in Anne Mahoney's essay in this collection) illustrates a digital community that emerged in the late 1990s before the rise of Wikipedia. The Suda is a 625,000 word, 30,000 entry Byzantine encyclopedia that offers a great deal of information not otherwise preserved about the classical Greek world. A group of classicists in Europe and North America organized a collaborative project to create the first comprehensive English translations of this resource. Progress has been steady and solid: in April 2000 1,500 entries had been translated, by July 2008 that number had increased by 23,000, with 24,500 entries translated and vetted (see http://www.stoa.org/sol/about.shtml).

The Stoa Publishing Consortium (http://www.stoa.org/), founded in 1997 with a grant from the Fund for the Improvement of Postsecondary Education, has done more than any single effort to foster the rise of digital communities in classics. Stoa.org provided support in a variety of ways to most of the major projects and classicists who emerged over the following decade. One such project, the Suda On Line, is illustrated in Figure 1.5. The papers in this collection provide an imposing, and still partial, account of the impact which the Stoa has had.

CRANE, SEALES AND TERRAS

29

If you examine a digital object in a digital community, you can not only find the background information that you need to interpret that object, but you can also make your own contributions by creating annotations directly, producing a blog linked to the object, or in some other fashion. In digital communities, the distinctions between author and reader and between reading and writing begin to dissolve (as the very act of reading becomes a statement of at least initial interest and thus a contribution).33

CYBERINFRASTRUCTURE From the anvil Hephaestus rose, a huge, panting bulk, halting the while, but beneath him his slender legs moved nimbly. The bellows he set away from the fire, and gathered all the tools with which he was building a silver chest; and with a sponge wiped he his face and his two hands, [415] and his mighty neck and shaggy breast, and put on a tunic, and grasped a stout staff, and went forth halting; but there moved swiftly to support their lord servants wrought of gold in the semblance of living women. They possessed understanding in their hearts, and speech [420] and strength, and they knew cunning handiwork by gift of the immortal gods. These busily moved to support their lord. (Homer, Iliad 18.411-421, tr. after A. T. Murray)

The three classes of digital project outlined above reflect three different sources of energy: the industrialized processes of mass digitization and of general algorithms, the specialized production of domain specific, machine actionable knowledge, and the generalized ability for many different individuals to contribute, in ways large and small. When these three sources of energy begin to interact with one another, the resulting environment is qualitatively different not only from print culture but from any of the three digital

For some interesting efforts to create digital reading/writing environments that allow for the creation and sharing of annotations and also support other types of more sophisticated scholarly communication, see (Bradley 2008); (Fitzpatrick 2007); (Schroeter 2007). 33

30

CHANGING THE CENTER OF GRAVITY

environments taken in isolation. Having reviewed some developments in the previous generation, we can now begin to consider the implications for ePhilology (primary and secondary sources relevant to classical Greek and Latin), eClassics (ancient Greek and Latin as they work within human minds), and Cyberinfrastructure (the material systems whereby we exchange the objects of our intellectual labor and ourselves internalize these objects). The following sections describe ePhilology and eClassics. The conclusion to this collection returns to the Cyberinfrastructure towards which the individual articles point.

PRODUCING NEW KNOWLEDGE: EPHILOLOGY Any one can discourse to you forever about the advantages of a brave defence, which you know already. But instead of listening to him I would have you day by day fix your eyes upon the greatness of Athens, until you become filled with the love of her; and when you are impressed by the spectacle of her glory, reflect that this empire has been acquired by men who knew their duty and had the courage to do it, who in the hour of conflict had the fear of dishonor always present to them, and who, if ever they failed in an enterprise, would not allow their virtues to be lost to their country, but freely gave their lives to her as the fairest offering which they could present at her feast. (Pericles’ Funeral Oration, Thuc. 1.43.2)

If we think only in terms of word searches, the production of camera-ready copy, image management, the ability to generate basic maps, and manually produced format such as wikis and blogs, increased storage and computational power may seem relatively unimportant. For anyone whose career extends more than a decade, current technologies are astonishingly powerful. In 1982, it cost the Harvard Classics Computing Project $34,000 to purchase a 660 megabyte disk drive to store early versions of the TLG: the disk was the size of a washing machine, arrived in a wooden crate, needed a special disk controller, took two days for the technicians to install and required modifications to the version of the Unix operating system then available. The maintenance contract cost c. $4,000/year and was essential. As this introduction is written, $100 buys a terabyte of storage — more than 1000 times as much storage as its 1982 predecessor for 300 times less money, a decrease in

CRANE, SEALES AND TERRAS

31

cost of more than 300,000 in one quarter of a century. We can now take for granted storage that was previously unimaginable, collecting huge digital images as well as texts and datasets with little regard for the costs of storage or computation. A generation ago, only a few of the wealthiest departments could raise tens of thousands of dollars to provide the storage to search a few million words of Greek and support the first generation of digital publishing. In 2008, many cell phones have more than enough storage and computational power to do much more. All of us in the academy and in society as a whole, of course, already depend upon general services, such as Google, that require stunning amounts of storage and computational power — even academics who may proudly dissociate themselves from the web of digital services depend completely upon those services for the paper publications that arrive in the mail and the catalogues by which they find books on the shelf. And, of course, we already depend upon digital infrastructure for the paychecks, medical treatments and other fundamental components of material life. Within classical studies, it is easy to see the need for vast networked storage and high performance computing for the analysis and visualization of quantitative and visual evidence from the material culture.34 Consider the basic problem of reading Greek and Latin. The machine-actionable Liddell-Scott-Jones (LSJ) Greek-English and Lewis and Short Latin-English lexica developed by the Perseus Project contain 422,000 and 303,000 tagged citations to 800 Greek and 80 Latin authors. In LSJ, half of the 422,000 citations are to a half dozen canonical authors. For Lewis and Short, the top dozen authors account for more than two-thirds (215,000) of the citations. Not all lexicographic projects have such narrow focus, but extensive lexicographic coverage is extraordinarily labor intensive. The Thesaurus Linguae Latinae (TLL) is building a lexicon that covers Latin from earliest times through AD 600 and bases its work on an archive of 10,000,000 slips with information about particular

34 The need for support for grid level computing for digital humanities projects has been discussed by (Gietz 2006); (Blanke 2006).

32

CHANGING THE CENTER OF GRAVITY

words. The TLL in 2008 boasts a staff of twenty Latinists, began work in 1894, published its first fascicle, and has been an international project since 1949. Its official website promises that the TLL will during 2009 "reach the end of the letter P, at which point more than two thirds of the complete work will have appeared".35 The ten million or so words of ancient Latin may require more then a century of labor, but they constitute, of course, a relatively small corpus. The TLG had accumulated 99,000,000 words in 2007.36 An individual Latinist, Johann Ramminger, had accumulated a wordlist of later Latin from Petarch up through 1700 that was based on 200,000,000 words of text already available in digital form. Semi-automated methods involving computerized data but still dependent upon manual analysis of each form may increase productivity by a factor of two or three, but simply enhancing traditional approaches would require centuries to provide us with truly comprehensive lexica of Greek and Latin. No branch of scholarship is probably older than lexicography, but our traditional methods do not scale up to the challenges of representing textual materials in Greek and Latin. We have no choice but to exploit, as vigorously as we can, automated methods. The essay by Bamman and Crane in this collection describes some of these methods as they exist today. The essay by Finkel and Stump illustrates how automated methods can reconfirm — but place on a profoundly new foundation — ancient analytical instruments such as the reduction of Latin verbs to a four dimensional space defined by the traditional principal parts. Ultimately, automated and manual methods reinforce one another. Decisions embedded in print reference materials such as lexica, indices, and grammars can be, at least in part, extracted and converted into machine actionable data. In effect, human annotators provide the examples and rules from which automated systems learn. The automated systems present the results of what they learn

See http://www.thesaurus.badw.de/english/index.htm/, accessed August 3, 2008. 36 http://www.tlg.uci.edu/, accessed August 3, 2008, lists August 12, 2007 as its last modification date. 35

CRANE, SEALES AND TERRAS

33

when they work with new materials. Human readers then correct and augment the automated results. The automated systems recalculate their statistical models and then recalculate.37 In a mature system, we separate training data from test data so that we can automatically measure the impact that our changes have upon performance. Complex algorithms can be computationally demanding even when we are working with small corpora. In preliminary work on sense detection in 2005, we found that by comparing five different translations with the 150,000 Greek words in Thucydides we can identify words with many senses in Thucydides: e.g., passages where the Greek word archê corresponds to "beginning" or to "empire". It took days of processing power from a single CPU to identify clusters of word senses in five translations of the 150,000 words in Thucydides.38 Even if we shift to these algorithms, analyzing millions of words and thousands of translations in a half dozen languages would require more computational power than any desktop system could readily deploy. The infrastructure of 2008 forces researchers in classics and in the humanities to develop autonomous, largely isolated, resources. We cannot apply any analysis to data that is not accessible. We need, at the least, to be able gather the data that is available today and, second, to ensure that we can retrieve the same data in 2050 or 2110 that we retrieve in 2010.39 We need digital libraries that may be physically distributed in different parts of the world but that act as a single unit: we need to be able to pose queries such as "find all Greek editions and modern language translations of Aes-

For some examples of this process, please see (Ganchev 2007), (Vlachos 2006), and (Culotta 2005). 38 Work, still unpublished, conducted by D. Sculley, a PhD student in Computer Science at Tufts University. This preliminary work led to the subsequent funded research described by Bamman and Crane in this collection. 39 This need for long term data curation of the scholarly record has recently been discussed by (Gold 2007) and (Luce 2008). 37

34

CHANGING THE CENTER OF GRAVITY

chylus, Persians, lines 1-40" and retrieve machine actionable results from a variety of sites.40 There are two components to this problem. First, we need libraries that can preserve collections in the digital world as they have preserved them in the print world. The institutional repository movement is slowly addressing this challenge.41 Thus, the publications in this collection are a part of a long-term institutional repository that can manage static expository prose with very general features such as sections, footnotes, bibliography etc. We need, however, more than digital preprints. A second component is the need for sophisticated citation and reference linking services. Smith’s paper in this collection, "Citation in Classical Studies", describes the system of canonical text citations by which classicists identify precise chunks of text within the surviving corpus of classical Greek and Latin. The Canonical Text Services (CTS, http://chs75.harvard.edu/projects/diginc/techpub/cts) described in this piece begin where library catalogues end and provide furthers layers of granularities essential for classical scholarship: the CTS provides a common language whereby we can aggregate information about particular lines in the Iliad or a numbered section from a chapter in Thucydides.42 The TEI has developed a shared language whereby humanists can describe the same phenomena in similar ways so that we can more readily combine documents produced by different groups. The TEI has many different methods, however, and it is possible to represent the same phenomenon in many different TEIcompliant ways. Cayless et al. describes how experts in Greek inscriptions as a community adapted the very general TEI framework to their needs, allowing classicists to create documents that are in-

40 For further discussion on the need for distributed digital libraries that can be searched seamlessly and the issues involved, please see (Simeoni 2007), (Trnkoczy 2006), and (Lagoze 2006). 41 For more on digital preservation and the need for institutional repositories, see (Marshall 2008); (Cantara 2006) and (Hockx-Yu 2006). 42 For more on the potential of CTS, see (Romanello 2008) and (Porter 2006).

CRANE, SEALES AND TERRAS

35

creasingly interoperable and easy to maintain over time. Robertson documents research in methods to describe historical events in a format that is not only machine actionable but language independent, contributing to the production of multilingual scholarship. Dué and Ebbott describe editorial standards for a new generation of dynamic digital editions. These new editions do not simply provide a single best attempt at reconstructing a single text but can dynamically represent multiple versions of the text as it has appeared over time and provide databases of variants, conjectures, testimonia and other materials. Elliott and Gillies look more generally at how we can then build on these and other services to manage geographic information about the ancient world in new ways. Wikipedia has provided a famous and famously successful model for distributed authorship, but classicists had already begun pioneering such systems in the 1990s. Mahoney’s article describes the infrastructure for the Suda On Line project, which has produced translations for more than 24,000 entries of a fundamental reference work about the classical Greek world produced in 10th century Byzantium. At the same time, Finkel and Stump illustrate how methods from computer science can manage such fundamental structures as Latin morphology. And, of course, only a small part of the printed record relevant to classical Greek and Latin has been — or will be — carefully transcribed and edited. If we begin to consider the challenge of extracting and analyzing information about classical Greek and Latin scattered throughout very large collections of books available as scanned page images, the challenges of storage and computation become daunting. The collection of essays thus ends with articles about converting print materials into a form that can support the kinds of services that the previous articles have articulated. Rydberg-Cox describes the issues involved in trying to convert early printed scholarship into a machine actionable form. Later publications lend themselves much more readily to automated analysis. Crane et al. consider the problems and opportunities that emerge for classics as whole research libraries become available in digital form.

36

CHANGING THE CENTER OF GRAVITY

Figure 1.6. An early element of cyberinfrastructure for philology: In this display, a reader has inquired about the form ‫ۂ‬Ʈ‫ܝ‬ƴƧƫơ. The morphological analysis system has, as it has since the 1980s, forged a link between this form and the dictionary entry ‫ۂ‬Ʈ‫ܝ‬ƴƧƫƯƲ, but two elements have been added. First, a simple machine learning system has analyzed morphologically unambiguous words in the Greek database to rank the probability for each possible analysis in this context. It has, however, chosen accusative, the wrong alternative in this case, but one of the readers has added a vote for the correct analysis (the adjective is, in fact, nominative). This figure thus includes (1) a simple transcription of a print source, (2) the output of knowledge-driven systems, and (3) feedback from a digital community which will, in turn, affect subsequent automatic analyses.

Infrastructure includes not only data, services and physical systems but the social practices as well. Figure 1.6 illustrates some of the particular elements of the cyberinfrastructure needed for philology. The papers in this collection illustrate shifts in the practices of classicists as a new cyberinfrastructure develops: x Expository argumentation: While new forms of scholarship and new intellectual practices are taking shape, we should emphasize that the collection published here reflects the on-going need for expository arguments that ar-

CRANE, SEALES AND TERRAS

x

x

37

ticulate particular points of view constructed at a particular time. Nevertheless, even when the superficial form of argumentation remains largely traditional in form, the substitution of dynamic links for static citations can exercise a major impact upon the content and the audience that publications can reach. Stoa.org was founded in 1997 to support, among other things, new forms of publication that would provide rich links to original sources while bringing classics to a broader audience. Thomas Martin’s Overview of Classical Greek History in the Perseus Digital Library and Ross Scaife’s Diotima, an electronic publication on gender in antiquity, did much to inspire this goal. All of the publications associated with the Stoa illustrate forms of publication that were not feasible a generation ago. Christopher Blackwell’s Demos: Classical Athenian Democracy illustrates how a publication that is traditional in form can exploit online evidence and publication to provide better documentation on a major subject to a wider audience than was feasible in print. Collaboration: While the final form of the papers in this collection may be familiar, their production and content reflects a fundamental change in scholarly practice: the majority of the papers published here have multiple authors, while the single-author papers either report on group projects or on general methods whereby classicists can create interoperable data. Open access and open source production: All of the scholars who have contributed to this collection depend upon open access and open source production. In contrast, Figure 1.7 illustrates an example of a much more closed form of access. In cases where authors are making particular arguments at a particular point in time, open access allows third parties to locate and automatically analyze what they have produced: search engines such as Google can index and then deliver their arguments to anyone online; more specialized text mining systems could analyze what has been written to search for trends in scholarship or to apply specialized services designed for classics (e.g., the ability to recognize strings such as "Thuc. 1.86" as citations to primary sources).

38

CHANGING THE CENTER OF GRAVITY

Figure 1.7. Twentieth century infrastructure in the digital world. Business models are a core component of every intellectual infrastructure. When information circulated on physical books through a thin network of research libraries, subscription models evolved to generate revenue. In a digital environment, such subscription models lead to situations such as that pictured above, where a digital copy of a two page review, produced by a scholar to reach the widest possible audience, distributed by a non-profit organization (JSTOR) would cost $19US. The medium sends a strong message to the general public.

The authors of these papers represent, however, a greater advance than the work that they have produced so far. In part, this reflects the hope that they will produce even more in the future. They also represent a new community, one large enough to foster junior scholars within the field, and in this way they may indirectly spawn far more productive work than all of them could in the aggregate produce during their own careers. But more significant than any output is the sense within this community that the field of classics is being reborn and that limitations with which many of us grew up are no longer relevant. This new digital world not only changes what we can do but who can do what. The collection of essays thus opens with Blackwell and Martin’s article about undergraduate research. Before introducing that discussion, we need return to the broader topic of classics and the humanities in a digital

CRANE, SEALES AND TERRAS

39

environment that has begun to increase the intellectual reach of humanity as a whole.

EXTENDING THE INTELLECTUAL REACH OF HUMANITY: ECLASSICS & EHUMANITIES In short, I say that as a city we are the school of Hellas; while I doubt if the world can produce a man, who where he has only himself to depend upon, is equal to so many emergencies, and graced by so happy a versatility as the Athenian. (Pericles’ Funeral Oration, Thuc. 2.41.1)

We look to a new digital infrastructure not only so that we can increase the body of published information about classical Greek and Latin but so that these languages can play an increased role in the intellectual life of humanity. We can do this in two ways. First, we can create environments that more fully engage those already working with Greek and Latin — we have already begun to address this by creating searchable corpora of Greek and Latin, by making secondary sources available online as PDF files or by adding links between inflected words in a text and their dictionary entries and thus reducing time spent flipping large dictionaries. These all reduce the time between when we pose a question and when we receive an answer. It would be hard to overstate the degree to which costbenefit decisions, often unconscious, shape the directions that we take in our intellectual lives. Classicists have for millennia understood the difference between being in a small, poorly organized collection and a large collection in which it is easy to find what we want. Cyberinfrastructure provides new threads that we can follow through the vast body of published information. The second way to increase the role of classical Greek and Latin is to engage more people in reading and thinking about these languages. Anecdotal evidence suggests that this began to happen as soon as substantial bodies of Greek and Latin became available to the general public. Perseus quickly received letters from students in isolated locations such as rural homes and naval vessels at sea who were using online lexica and texts. Even more interesting, people who had studied Greek and Latin decades before found that the reading support tools available online gave them the support that they needed to begin reading Greek and Latin again.

40

CHANGING THE CENTER OF GRAVITY

The first paragraph in the opening "Call to action" of the National Science Foundation’s 2007 "Cyberinfrastructure Vision for 21st Century Discovery" calls for "an individualized health model of every human being for personalized health care delivery" ("Cyberinfrastructure Vision for 21st Century Discovery", March 2008: page 5). Such models would open up new methods where doctors and patients could not only determine the best courses with which to treat disease but also to identify potential problems and predispositions in advance. Health records that include decades of medical tests and case histories clearly raise daunting issues of confidentiality, but the potential benefits are enormous. Emergent cyberinfrastructure for health care includes thus both methods to represent our particular background in great detail and a major investment in maintaining personal privacy.

Figure 1.8 Customization of Latin Vocabulary43

The same instruments developed for health care can be adapted for our intellectual backgrounds. We can begin to devise ways for us to keep track of what we have learned so that we can receive background information customized for our particular

43

Reprinted from (Crane 2007).

CRANE, SEALES AND TERRAS

41

needs when we confront a new object of study.44 Figure 1.8 illustrates a system that compares an arbitrary text of Latin against a model of the vocabulary that a particular reader has encountered, then calculates which words have been seen before and which are new. Seen words can then be associated with the places where they have been seen in the past, while unseen words can be ranked by their importance according to various criteria (e.g., numerical frequency, relevance to a particular theme etc.) The implementation is conceptually simple but represents the first stage at an open-ended process. As our data sources improve, we can look for more complex linguistic phenomena such as syntax and semantics (e.g., a new sense of a seen word). As our learning models grow more sophisticated, we can begin helping readers identify areas of weakness on which they can focus to enhance their ability to read with fluency. Even small advances in our ability to work with multiple languages can be important if they open up historical languages to new audiences, whether these audiences are professional researchers using more linguistic sources or members of the public reading Greek poetry that they would not otherwise have experienced. The biggest benefits are likely to come when we open up linguistic materials to audiences with little or no training in the language. None of us has the opportunity to become familiar with more than a handful of languages. None of us can, in print culture, work with un-translated sources in dozens of languages. Classics can, however, show how knowledge about an ancient culture can be designed to serve the speakers of multiple languages. The traditional method is for communities to choose a lingua franca — Akkadian, Greek, Latin, French, German, and now English have all served as common languages of diplomacy and scholarship. The speakers of an unbounded set of local languages communicate by learning one of these linguae francae — thus, the Chinese businessman in a Damascus hotel will probably carry on his business in English. Classicists are more broad-minded but gener-

44 Adaptive systems that customize themselves automatically to what a user has already learned have been in development for a number of years now, for some recent work see (Heilman 2008).

42

CHANGING THE CENTER OF GRAVITY

ally expect scholars to publish materials in English, French, German and Italian. Speakers of Croatian or Modern Greek must learn these languages if they are to gain access to most information about the Greco-Roman world. Classicists can, however, design their Cyberinfrastructure from the start to be as portable as possible across multiple languages. There are at least three basic strategies, the third and most important of which is peculiarly suitable to historical fields where primary sources are finite and heavily studied. First, we need to be able to optimize machine translation for the field of classics.45 We can develop statistical models that capture the idiosyncrasies of documents about Greco-Roman culture. We develop these models by adding markup, using a combination of manual and automated methods, to finite bodies of material as training sets. Machine learning systems then scan these bodies and recognize that Alexandria usually refers to the city in Egypt and almost never to the suburb of Washington, DC, by that name. An ambiguous word such as “case” probably designates a grammatical case in a Greek grammar and a display case in a museum catalogue. These domain specific features, once identified, can help general machine translation systems avoid many of the worst problems they face and improve the quality of their output. Second, we need to include as much basic information as we can in forms from which they can be converted into multiple languages. Thus, if we represent birth and death dates in a generic form, we can then develop modules to represent that knowledge in multiple languages. 46Some ontologies such as the CIDOC-CRM for museum objects and FRBR for books have been under devel-

45 Some cultural heritage projects have conducted research into how machine translation can be customized for more resource-poor languages, see (Jones 2007). 46 Various research has explored the potential of either translating semantic markup into multiple languages or mapping between languages, see (Monroy 2007) and (Bia 2006).

CRANE, SEALES AND TERRAS

43

opment for years and can represent a great deal of basic background information.47 Third, canonical literary texts attract very large amounts of labor. We can use that labor to create databases of linguistic annotations that describe syntax (e.g., the subject and object of a verb), co-reference (e.g., which person is the subject of a particular verb), semantics (e.g., where does oratio correspond to "prayer" rather than "oration" or some other concept). These annotations stored in treebanks and other linguistic databases not only allow us to put our understanding of Greek and Latin on a wholly new, quantifiable foundation but can resolve the ambiguities that bedevil machine translation and can ultimately support higher quality machine translation.48 Such annotations are expensive but are, in effect, the digital successors to print editions. Where print editors labored to resolve ambiguities and problems in the textual tradition, digital editors provide machine actionable annotations that resolve where possible ambiguities in the reconstructed texts. The problem of multilingual knowledge thus breaks down into language independent and language dependent phases. Knowledge bases (e.g., basic propositional statements) and linguistic annotation can be created by speakers of any language. The tag sets of ontologies and annotation schemes are relatively contained and can themselves be translated, allowing authors to work entirely with Greek, Latin and their own primary languages: the birthdate of a given author may be uncertain but that uncertainty can be represented in a general form by the speaker of any language. We may differ in how we construe the syntax of a sentence, but anyone who knows Greek, regardless of their native language, can decide which word depends on which and represent this in a common format.

For a specific look at how CIDOC-CRM is being used with multilingual texts, see (Genereux 2006). 48 The Perseus Project has recently begun work on a Greek treebank, and work on a Latin treebank has been ongoing for over three years, for more on the Latin treebank, see (Bamman 2007) and (Bamman 2006). 47

44

CHANGING THE CENTER OF GRAVITY

Communities that want to make publications in their own languages accessible to wider audiences will have to develop the training sets for documents about classics. The results will not be perfect but readers can then use dictionary lookups and other translation aids to more closely study the original language. Each language needs its own training sets but this approach will not only make publications in the traditional languages of publication accessible to wider audiences but will also open up publications in less widely read languages (e.g. Croatian and Dutch) to much larger audiences. Communities that want to be able to read basic knowledge about the Greco-Roman world in their own language will need machine translation that can be optimized for classics and language specific drivers that can convert the basic knowledge from ontologies into their language, and systems that can exploit the dense linguistic annotations available for major canonical source texts. The creation of knowledge bases designed from the start to flow from language to language would be a radical change from traditional scholarly practice. Nevertheless, there are profound strategic reasons for this new form of scholarship in the two major classes of society that produce scholarship about the Greco-Roman world. Classical Greek and Latin are the foundational languages of Europe and were the languages of high culture and trans-European discourse until relatively recent times — in fact, Turkey, whatever its religious background, would only restore to Europe a region that had been lost to it from the past. The European Union has a commitment to make the cultural heritage of its nations intellectually accessible to the widest possible audience. This implies an infrastructure that maximizes what can be learned not only in English, French, German, and Italian, but in all of the other official languages of Europe.49 The United States, Canada, Australia, New Zealand, and South Africa are, however, not only geographically distinct from

49 The challenges of supporting multi-lingual access to Europe’s cultural heritage through the European Digital Library have been discussed in (Agosti 2007).

CRANE, SEALES AND TERRAS

45

Europe but are fashioning themselves into cosmopolitan societies, European in origin but creating new identities with roots from every civilization of humanity. The United States has in particular identified Chinese and Arabic as the two strategic languages on which it will concentrate its resources. While Europe concentrates on making its cultural heritage accessible to the speakers of its official languages, American scholars can take the lead in making classical antiquity increasingly accessible to speakers of Chinese, Arabic and other languages. Ultimately, the increased distribution of Greco-Roman cultural materials into many other languages will speed the complementary process of opening up materials in classical Chinese, Arabic, Sanskrit and other languages to speakers of English and other European languages. Our larger goal must be to make the record of humanity accessible to everyone regardless of linguistic and cultural background. While a linguistically and culturally portable knowledge base about the Greco-Roman world may seem daunting, the tools already at hand allow us to rethink not only who can read and consume primary and secondary sources but who can contribute substantively to the field. Blackwell and Martin’s essay opens this collection by describing how the practices of undergraduates have begun to change. The rise of undergraduate research is arguably the most important and promising development for classics as a discipline since classics lost its privileged position. Before we can appreciate the possibilities of the technology now available but not yet fully exploited, we need to see how much classicists have already begun to accomplish. Before turning to the prospects for undergraduate and more general non-specialist research in classics, we should emphasize that the collection of essays published here themselves illustrate the greatest achievement of classical philology in this digital world. We now have a critical mass of classicists who are committed to building and exploiting the evolving digital infrastructure upon which all scholarship and teaching in our field will depend. While discussions of digital humanities still revert to the problem of tenure and promotion, several of the contributors to this collection have already earned tenure by pursuing digital projects. All of the authors here are able to review innovative forms of digital scholarship on its intellectual merits, neither penalizing or rewarding the use of digital technologies per se but assessing the degree to which the new work

46

CHANGING THE CENTER OF GRAVITY

advances our ancient and unchanging goals to bring the GrecoRoman heritage in general and ancient Greek and Latin in particular ever more fully to life in the minds of the broadest audience possible. No one showed more vision and patience to create this community than our colleague and beloved friend, Allen Ross Scaife. He showed the way with his own pioneering work on Diotima, a digital representations of women in antiquity. As director of the Stoa from its founding until his death ten years later, Ross always understood that the greatest resource for any field was the people whom it attracted. Ross supported, fostered, encouraged, and advanced careers that will continue now for decades and will shape other careers as well. "Do not lament," the Pericles of Thucydides (1.143.5) tells the Athenians, "houses and land but people, for it is not houses and land that acquire people but people who acquire them." The passing of Ross Scaife wounds the field of classics more deeply than would have the loss of everything that the field as a whole has produced. But the community that Ross fostered with intelligence, patience and love and that produced these essays is greater than any single achievement that their authors could ever produce.

BIBLIOGRAPHY ACLS 2006 Our Cultural Commonwealth: A Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences, 2006. http://www.acls.org/uploadedFiles/Publications/Progr ams/Our_Cultural_Commonwealth.pdf. Agosti 2007 Agosti, M. et al. "Roadmap for Multilingual Information Access in the European Library." In Proceedings of the ECDL 2007: 136-147. Arms 2007 Arms, W. and R. Larsen. The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship. Report on a NSF-JISC Workshop, April 17-19 2007. http://www.sis.pitt.edu/~repwkshop/SISNSFReport2.pdf. Audenaert 2008 Audenaert, N. and R. Furuta. "Annotated Facsimile Editions: Defining Macro- Level Structure for Image-Based Electronic Editions." Digital Humanities 2008 Abstracts.

CRANE, SEALES AND TERRAS

47

http://www.ekl.oulu.fi/dh2008/Digital%20Humanities %202008%20Book%20of%20Abstracts.pdf. Bamman 2006 Bamman, D. and G. Crane. "The Design and Use of a Latin Dependency Treebank." TLT 2006: Proceedings of the Fifth International Treebanks and Linguistic Theories Conference: 67-78. http://dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.002. 00005. Bamman 2007 Bamman, D. and G. Crane. "The Latin Dependency Treebank in a Cultural Heritage Digital Library." In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTech 2007): 33-40. http://dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.002. 00002. Bia 2006 Bia, A. et al. "A Multilingual Markup Translation WebService. An Entry Level Solution to Internationalize XML Markup Vocabularies." WEBIST 2006. http://cio.umh.es/ES/publicaciones/ficheros/CIO_20 06_06.pdf. BL 2008 BL/JISC. Information Behavior and the Researcher of the Future. January 11, 2008. Joint Report funded by the BL/JISC. http://www.jisc.ac.uk/media/documents/programmes /reppres/gg_final_keynote_11012008.pdf. Blanke 2006 Blanke, T. et al. "Digital Libraries in the Arts and Humanities Current Practices and Future Possibilities." INSCIT 2006, http://www.slideshare.net/inscit2006/digital-librariesin-the-arts-and-humanities-current-practices-and-futurepossibilities. Borin 2007 Borin, L. et al. "Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature." In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007): 1-8. http://www.aclweb.org/anthology-new/W/W07/W070901.pdf. Boschetti 2007 Boschetti, F. "Methods to Extend Greek and Latin Corpora with Variants and Conjectures: Mapping Critical Apparatuses onto Reference Text." In CL 2007: Proceedings of the Corpus Linguistics Conference,

48

CHANGING THE CENTER OF GRAVITY

http://www.corpus.bham.ac.uk/corplingproceedings07 /paper/150_Paper.pdf. Bradley 2008 Bradley, J. "Pliny: A Model for Digital Support of Scholarship." Journal of Digital Information, 9:26 (2008), http://journals.tdl.org/jodi/article/view/209/198. Buckland 2007 Buckland, M. "The Digital Difference in Reference Collections." Journal of Library Administration, 46:2 (2007): 87-100. Busa 1974 Busa, R. Index Thomisticus. Stuttgart: FrommannHolzboog, 1974. Busa 1980 Busa, R. "The Annals of Humanities Computing: The Index Thomisticus." Computers and the Humanities, 14:2 (1980): 8390. Cantara 2006 Cantara, L. "Long term Preservation of Digital Humanities Scholarship." OCLC Systems & Services, 22:1 (2006): 38-42. Chi 2007 Chi, E. H. et al. "ScentIndex and ScentHighlights: Productive Reading Techniques for Conceptually Reorganizing Subject Indexes and Highlighting Passages." Information Visualization, 6:1 (2007): 32-47. Cosley 2006 Cosley, D. et al. "Using Intelligent Task Routing and Contribution Review to Help Communities Build Artifacts of Lasting Value." CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems: 10371046. Crane 2000 Crane, G. and J. A. Rydberg-Cox. "New Technology and New Roles: The Need for Corpus Editors." Proceedings of the 5th ACM Conference on Digital Libraries 2000: 252-253. http://perseus.mpiwgberlin.mpg.de/Articles/corpused.pdf. Crane 2005 Crane, G. "Reading in the Age of Google: Contemplating the Future With Books That Talk to One Another." Humanities, 26:5 (2005), http://www.neh.gov/news/humanities/200509/readingintheage.html. Crane 2006a Crane, G. et al. "Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries." In Proceedings of the ECDL 2006: 353-366, http://dl.tufts.edu//view_pdf.jsp?urn=tufts:facpubs:gc rane-2006.00002.

CRANE, SEALES AND TERRAS

49

Crane 2006b Crane, G., and A. Jones. "Text, Information, Knowledge and the Evolving Record of Humanity." D-Lib Magazine, 12:3 (2006). http://www.dlib.org/dlib/march06/jones/03jones.htm l. Crane 2006c Crane, G. and A. Jones. "The Challenge of Virginia Banks: an Evaluation of Named Entity Analysis in a 19th-Century Newspaper Collection." In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries: 31-40. http://dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.001. 00007. Crane 2007 Crane, G., et. al. "ePhilology: When the Books Talk to Their Readers." In A Companion to Digital Literary Studies (New York, London: Blackwell Publishing, 2007): 2964. http://dl.tufts.edu//view_pdf.jsp?urn=tufts:facpubs:gc rane-2006.00003. Csomai 2006 Csomai, A. and R. Mihalcea. "Creating a Testbed for the Evaluation of Automatically Generated Back-ofthe-Book Indexes." In Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 2006. http://www.cse.unt.edu/~rada/papers/csomai.cicling0 6.pdf. Culotta 2005 Culotta, A. and A. McCallum. "Reducing Labeling Effort for Structured Prediction Tasks." In Proceedings of AAAI 2005. http://www.cs.umass.edu/~mccallum/papers/multich oice-aaai05.pdf. Dalbello 2006 Dalbello, M. et al. "Electronic Texts and the Citation System of Scholarly Journals in the Humanities: Case Studies of Citation Practices in the Fields of Classical Studies and English Literature." In LIDA 2006: Proceedings of Libraries in the Digital Age, http://dlist.sir.arizona.edu/1638/. Dekhytar 2006 Dekhytar, A. et al. "Support for XML Markup of Image-Based Electronic Editions." International Journal of Digital Libraries, 6:1 (2006): 55-69. Dimitriadis 2006 Dimitriadis, A. et al. "Toward A Linguists WorkBench Supporting eScience Methods." In E-SCIENCE

50

CHANGING THE CENTER OF GRAVITY

'06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing: 131-9. http://www.latmpi.eu/papers/papers-2006/escience-sketchfinal2.pdf/view. Don 2007 Don, A. et al. "Discovering Interesting Usage patterns in Text Collections: Integrating Text Mining with Visualization." In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on Information and Knowledge Management: 213-222. http://hcil.cs.umd.edu/trs/200708/2007-08.pdf. Elings 2007 Elings, M.W. and G. Waibel. "Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives, and Museums." First Monday, 12:3 (2007), http://firstmonday.org/issues/issue12_3/elings/index. html. Fitzpatrick 2007 Fitzpatrick, K. "CommentPress: New (Social) Structures for New (Networked) Texts." Journal of Electronic Publishing, 10:3 (2007). http://hdl.handle.net/2027/spo.3336451.0010.305. Ganchev 2007 Ganchev, K. et al. "Semi-Automated Named Entity Annotation." Proceedings of the Linguistic Annotation Workshop. ACL, Prague, Czech Republic, 2007: 53-56. http://www.aclweb.org/anthology-new/W/W07/W071509.pdf. Garrett 2006 Garrett, J. "KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching." Journal of Electronic Publishing, 9:1 (2006), http://hdl.handle.net/2027/spo.3336451.0009.106. Gatos 2006 Gatos, B. et al. "An Efficient Segmentation-Free Approach to Assist Old Greek Handwritten Manuscript OCR." Pattern Analysis & Applications, 8:4 (2006): 305320. Geleijnse 2007 Geleijnse, G. and J. Korst. "Creating a Dead Poets Society: Extracting a Social Network of Historical Persons from the Web." In Proceedings of the Sixth International Semantic Web Conference and the Second Asian Semantic Web Conference (ISWC + ASWC 2007): 156-168. http://iswc2007.semanticweb.org/papers/155.pdf. Genereux 2006 Genereux, M. and D. Arnold. "Preserving Meanings in Multilingual Text Mining for Cultural Heritage."

CRANE, SEALES AND TERRAS

51

In ICS-Forth Workshop: Exploring the Limits of Global Models for Integration and Use of Historical and Scientific Information,2006 http://cidoc.ics.forth.gr/workshops/heraklion_october _2006/genereux_arnold.pdf. Gietz 2006 Gietz, P. et. al. "TextGrid and eHumanities." In ESCIENCE '06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing. http://www.textgrid.de/fileadmin/TextGrid/veroeffen tlichungen/TextGrid-Amsterdam-2006-final.pdf. Gold 2007 Gold, A. "Cyberinfrastructure, Data, and Libraries, Part 2: Libraries and the Data Challenge: Roles and Actions for Libraries." D-Lib Magazine, 9 (2007). http://www.dlib.org/dlib/september07/gold/09goldpt2.html. Heilman 2008 Heilman, M. et al. "Retrieval of Reading Materials for Vocabulary and Reading Practice." Proceedings of the Third ACL Workshop on Innovative Use of NLP for Building Educational Applications, 2008: 80-88. http://aclweb.org/anthology-new/W/W08/W080910.pdf. Hockx-Yu 2006 Hockx-Yu, H. "Digital Preservation in the Context of Institutional Repositories." Program: Electronic Library & Information Systems, 40:3 (2006): 232-243. Hyman 2008 Hyman, M. D. "Term Discovery in an Early Modern Latin Scientific Corpus." Digital Humanities 2008 Abstracts: 136-137. http://www.ekl.oulu.fi/dh2008/Digital%20Humanities %202008%20Book%20of%20Abstracts.pdf. Jones 2007 Jones, G. J. F. et al. "Multilingual Search for Cultural Heritage Archives via Combining Multiple Translation Resources." In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007): 8188. http://www.aclweb.org/anthologynew/W/W07/W07-0911.pdf. Kelly 2006 Kelly, K. "Scan This Book!" New York Times Magazine, May 14, 2006: 42+. http://www.nytimes.com/2006/05/14/magazine/14pu blish-

52

CHANGING THE CENTER OF GRAVITY

ing.html?ex=1305259200&en=c07443d368771bb8&ei= 5090. Kirschenbaum 2007 Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation. http://www.cs.umbc.edu/~hillol/NGDM07/abstracts /talks/MKirschenbaum.pdf. Kolak 2008 Kolak, O. and B. N. Schilit. "Generating Links by Mining Quotations." In HT '08: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia: 117-126. Kraft 2005 Kraft, J. C., Rapp, G., Gifford, J. and Aschenbrenner, S., "Coastal Change and Archaeological Settings in Elis", in Hesperia 74 (2005): 1-39. Krowne 2003 Krowne, A. "Building a Digital Library the Commons-Based Peer Production Way." D-Lib Magazine, 9:10 (2003). http://www.dlib.org/dlib/october03/krowne/10krown e.html. Lagoze 2006 Lagoze, C. et al. "Metadata Aggregation and Automated Digital Libraries: a Retrospective on the NSDL Experience.". In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital Libraries: 230239. Lally 2007 Lally, A. M. and C. E. Dunford. "Using Wikipedia to Extend Digital Collections." D-Lib Magazine, 13: 5/6 (2007). http://www.dlib.org/dlib/may07/lally/05lally.html. Lu 2008 Lu, X. et al. "A Metadata Generation System for Scanned Scientific Volumes." In JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries: 167176. Luce 2008 Luce, R. E. "A New Value Equation Challenge: The Emergence of E-Research and Roles for Research Libraries." In No Brief Candle: Reconceiving Research Libraries for the 21st Century. CLIR 2008: 42-50, http://www.clir.org/pubs/reports/pub142/pub142.pdf .

CRANE, SEALES AND TERRAS

53

Marshall 2008 Marshall, C. C. "From Writing and Analysis to the Repository: Taking the Scholars' Perspective on Scholarly Archiving." In JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries: 251260. Moalla 2006 Moalla, I. et al. "Image Analysis for Palaeography Inscription." in DIAL 2006: Document Image Analysis for Libraries: 303-311. Monroy 2007 Monroy, C. et al. "A Multilingual Approach to Technical Manuscripts: 16th and 17th-century Portuguese Shipbuilding Treatises." In JCDL '07: Proceedings of the 2007 conference on Digital libraries: 413-414. Plaisant 2006 Plaisant, C. et al. "Exploring Erotics in Emily Dickinson's Correspondence with Text mining and Visual Interfaces." In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries: 141150. Ponzetto 2007 Ponzetto, S. P. "Creating a Knowledge Base From a Collaboratively Generated Encyclopedia." In Proceedings of the NAACL-HLT 2007 Doctoral Consortium: 9-12. http://acl.ldc.upenn.edu/N/N07/N07-3003.pdf. Porter 2006 Porter, D. et al. "Creating CTS Collections." Digital Humanities 2006: 269-274. http://www.csdl.tamu.edu/~furuta/courses/06c_689d h/dh06readings/DH06-269-274.pdf. Pritchard 2008 Pritchard, D. "Working Papers, Open Access, and Cyber-infrastructure in Classical Studies." Literary and Linguistic Computing, 23:2 (2008): 149-162. http://ses.library.usyd.edu.au/handle/2123/2226. Riva 2005 Riva, M. and V. Zafrin. "Extending the Text: Digital Editions and the Hypertextual Paradigm." In HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia: 205-207. Robinson 2000 Robinson, P. "The One Text and the Many Texts." Literary and Linguistic Computing, 15:1 (2000): 5-14. Robinson 2005 Robinson, P. "Current Issues in Making Digital Editions of Medieval Texts or, do Electronic Scholarly Editions have a Future?" Digital Medievalist, 1:1 (2005). http://www.digitalmedievalist.org/journal/1.1/robinso n/.

54

CHANGING THE CENTER OF GRAVITY

Romanello 2008 Romanello, M. "A Semantic Linking Framework to Provide Critical Value- Added Services for EJournals on Classics." In ELPUB2008. Open Scholarship: Authority, Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing. http://elpub.scix.net/cgibin/works/Show?401_elpub2008. Rosenzweig 2006 Rosenzweig, R. "Can History be Open Source: Wikipedia and the Future of the Past?" Journal of American History, 93:1 (2006): 117-146. http://chnm.gmu.edu/resources/essays/d/42. Sankar 2006 Sankar, K. et al. "Digitizing a Million Books: Challenges for Document Analysis." in Document Analysis Systems VII (2006): 425-436, http://cvit.iiit.ac.in/papers/pramod06Digitizing.pdf. Schilit 2008 Schilit, B. N. and O. Kolak. "Exploring a Digital Library through Key Ideas." In JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries: 177186. Schroeter 2007 Schroeter, R. et al. "Annotating Relationships Between Multiple Mixed-Media Digital Objects by Extending Annotea." In Proceedings of ESWC 2007: 533548. http://espace.library.uq.edu.au/view/UQ:151380. Shirky 2008 Shirky, C. "Here Comes Everybody." Retrieved 08/02, 2008, from http://blip.tv/file/855937/. Simeoni 2007 Simeoni, F. et al. "A Grid-Based Infrastructure for Distributed Retrieval." Proceedings of the ECDL 2007: 161-173. Smith 2001 Smith, D.A. and G. Crane. "Disambiguating Geographic Names in a Historical Digital Library." In ECDL '01: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries: 127-136, http://perseus.mpiwgberlin.mpg.de/Articles/geodl01.pdf. Sperberg 1994 Sperberg-McQueen, C. M. and L. Burnard, Eds. Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative, 1994. Tobin 2008 Tobin, R. et al. "Named Entity Recognition for Digitised Historical Texts" in Proceedings of the Sixth International Language Resources and Evaluation Conference

CRANE, SEALES AND TERRAS

55

(LREC'08). http://www.ltg.ed.ac.uk/np/publications/ltg/papers/b opcris-lrec.pdf. Trnkoczy 2006 Trnkoczy, J. et al. "A Grid-Based Architecture for Personalized Federation of Digital Libraries." Library Collections, Acquisitions, and Technical Services, 30:3-4 (2006): 139-53. United 1880 United States. War Dept., United States. War Dept. War Records Office., et al. The War of the Rebellion: a compilation of the official records of the Union and Confederate armies. Washington, Govt. Print. Off., 1880. van 2006 van Gendt, M. et al. "Semantic Web Techniques for Multiple Views on Heterogeneous Collections: A Case Study." In Proceedings of ECDL 2006: 426-437. van den Branden 2007 van den Branden, R. and E. Vanhoutte. 2007. "Through the Reading Glass: Generating an Editorial Microcosm Through Experimental Modelling." Digital Humanities 2007. http://www.digitalhumanities.org/dh2007/abstracts/x html.xq?id=182. Veltman 1999 Veltman, K. "Digital Reference Rooms: Access to Historical and Cultural Dimensions of Knowledge." INET 99. http://www.isoc.org/inet99/proceedings/2b/2b_1.ht m. Vlachos 2006 Vlachos, A. "Active Annotation." In Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction. http://acl.ldc.upenn.edu/W/W06/W06-2209.pdf.

TECHNOLOGY, COLLABORATION, AND UNDERGRADUATE RESEARCH CHRISTOPHER BLACKWELL FURMAN UNIVERSITY [email protected]

THOMAS R. MARTIN

,COLLEGE OF THE HOLY CROSS [email protected] ABSTRACT In this article, two professors of Classics present their experiences in incorporating into their professional activity a model of undergraduate research that reflects Ross Scaife’s ideals of collaborative, open scholarship, informed by traditional values, and taking advantage of advances in digital humanities.

INTRODUCTION We write this essay in memory of our friend and colleague Ross Scaife. Since Classics is what we, the authors, know best, and since Ross Scaife’s contribution to Classics is incalculable, we are focusing on Classics as a sub-discipline of the humanities. We hope that others may find our observations, limited as they necessarily are to our own experience, interesting, useful, or provocative. 57

58

CHANGING THE CENTER OF GRAVITY

To survive and prosper, colleges and universities have to sell themselves to prospective students (or at least to prospective parents) as nurturing environments, where students and faculty work hand-in-hand, ideally on important-seeming and photogenic tasks. When "undergraduate research" gets pitched on websites, massmailed DVDs, and glossy brochures, it invariably takes that form of chemical glassware or a string quartet. Students and teachers in the field of Classics, however, when they appear in such promotional materials, are usually shown in the context of advertising an institution’s "dedication to teaching," because in the context of "teaching" it is okay to show the middle-aged professor gesturing at a blackboard covered with scrawled notes. And as Gregory Crane has also described in the introduction to this issue, this picture is not a misleading accident of the demands of superficial marketing but is too often an accurate reflection of an unfortunate truth. 1 Faculty in Classics have not, in our experience as students and teachers, been very interested in fostering the collection of activities that now go under the general term "undergraduate research". Classics faculty have traditionally claimed to inspire the life of the mind, to teach critical and subtle thinking, and to exercise students’ intellects through close reading, rigorous philology, and stimulating discussion, and often these claims are entirely justified. But it would be disingenuous to assert that the most insightful reading, the parsing of the most complex syntax, or the most lively conversation is similar in kind or effect to a public performance of Ravel’s "String Quartet: II - Assez - Très Rythme," which in the hands of well trained undergraduates can send shivers down the spines of an audience, or similar in kind and effect to undergraduates working in a laboratory that is engaged in developing new therapies to cure

1 The topic of undergraduate research, and in particular undergraduate research in the humanities, is the topic of enthusiastic study and ongoing publication. For some recent discussions whose scope is perhaps broader than the present subjective account, see, among others: (Hu 2007), (Ishiyama 2007), (Kinkead 2003), (Lo-patto 2006), (Malachowski 2003), (Merkel 2003), (Roger 2003), (Wilson 2003).

BLACKWELL AND MARTIN

59

diseases. We can make comparisons, but they are more metaphors than analogues. The ’cellist or the biochemist, to choose only two examples, can have an immediate and concrete effect on the larger world around them during their undergraduate years, while young students of Classics too frequently do not, and can not, at least not without some creative thinking on the part of their instructors. Undergraduate research in the Classics, as traditionally practiced, is a diluted version of professional scholarship in the field as it developed during the second half of the 20th century—and as it came to be seen by the turn of the 21st century as an absolute touchstone for appropriate professional activity. We teachers send our students forth to read scholarship and produce argumentative essays, carefully and selectively annotated with citations to primary and secondary sources. Ideally, the primary sources served our students as evidence, and the citations to secondary sources provided armor against charges of plagiarism. We call this a "diluted" version of professional scholarship for two reasons. A given undergraduate student—let’s say that she is taking her first course in Roman History, a survey course in a subject of tremendous complexity based on literally centuries of cumulative scholarship in multiple disciplines from epigraphy and numismatics to textual criticism and literary theory—is unlikely in the course of a semester or quarter to be able to develop and advance an idea or interpretation that has never before been produced by a professional scholar in the centuries long history of research in ancient history. For this reason, no responsible teacher of Classics would insist on originality (in this sense of a new and unique idea or interpretation, as opposed to an idea or interpretation that is one’s own and expresses one’s personal intellectual effort) in papers assigned in an introductory survey. (A professional scholar in the field, by contrast, must demonstrate an original thesis, in the sense of an idea or interpretation that is not only his or her own but is also new and unique, as the first condition of publication.) Taking advantage of the kinds of papers that we can responsibly expect our undergraduate students in Classics to write is the subject of the first section of this paper, An Audience of More Than One…. The second way in which undergraduate research in Classics tends to deserve the appellation "diluted" is in its approach to citation of sources. Undergraduates cite sources for two reasons: first, because their teacher insisted that they consult a certain number of

60

CHANGING THE CENTER OF GRAVITY

sources, as an exercise in "learning to do research," and second, to prove that they did not plagiarize any of the sentences that they have strung together to form an argument. But citation is not a predigital anti-plagiarism technology, or at least it should not be in a wholesome intellectual environment. Citation is a pre-digital equivalent to the hyperlink, a way of continuing an ongoing conversation in print, a pathway back from the author’s current words, to previous comments on the topic at hand. This idea, in our experience, is often completely unfamiliar to undergraduates, even now that that they are subjected to rigorous formal indoctrination against the evils of plagiarism. In the second section of this paper, "When All the Sources are Online," we explore some possibilities for educating our students in a more positive understanding of citation, and we will suggest a mode of scholarship that can make plagiarism less of a temptation for the student and therefore less of a concern for the teacher. When done with insight and method, citation generates the only kind of research that carries conviction: research based on a clearly defined data set whose parameters are unambiguously described, thus opening the way to complementary and, in the best of circumstances truly collaborative, extension of the results. The traditional mode of Classical scholarship—deep and wide reading yielding new insights which are expressed through rigorous argument—is extremely difficult and has in the past not lent itself to collaboration, especially collaboration between young scholars and their teachers. But there are modes of humanist scholarship that do lend themselves to this collaboration, especially when the environment of scholarship is flexible enough to accommodate that collaboration and value it. In the third part of this paper, From Each According…, we discuss one category of work—collation and indexing of primary-source material—that has a long tradition as a respected pursuit in Classics and is appropriate for undergraduates. As Gregory Crane asserts in the Introduction, the value of this kind of work has multiplied in the face of technologically mediated scholarship, as our discipline rebuilds itself from the ground up. In the final section of this paper, Shaking the Foundations, we describe at least one way in which undergraduate researchers in Classics are now, thanks to the mediation of technology, in a position to help re-build the discipline of classical philology on a more

BLACKWELL AND MARTIN

61

sound footing, undertaking projects similar to those done in earlier centuries by figures who are now hailed as giants of Classical scholarship. The late 19th and early 20th centuries saw the publication of authoritative editions of the major Greek and Latin authors. This period can be seen to have culminated with the publication in 1931 of the editio maior of T. W. Allen’s Homeri Ilias, in three volumes, the work of 44 years. Allen’s monumental edition collects in its critical apparatus variant readings from dozens of the most important Byzantine and medieval manuscripts, and hundreds of papyrus witnesses to the text of the Homeric poem. The kind of work that Allen brought forth remains the essential foundation of Classics, now and into the future. Nevertheless, his effort necessarily was defined and limited by the state of technology in his day. Therefore, to extend and amplify Allen’s results, beginning in the winter of 2007, two teams of scholars, university faculty supervising undergraduates, undertook to use his work on the Iliad as the basis for an entirely new approach to a critical edition of Homeric poetry. In the process, these undergraduate researchers became experts in Homeric text-criticism and pushed forward the boundaries of our understanding of that field and of the poetry it concerns.

AN AUDIENCE OF MORE THAN ONE… A traditional writing assignment in an undergraduate Classics course is an essay in which the student-author argues a thesis with supporting evidence from primary and secondary sources. Such assignments are opportunities for teaching students to articulate their thoughts and use sources appropriately. Properly understood, an essay written for such an assignment constitutes a moment in a conversation, in which the student interprets primary sources, understood through the lens of previous scholarship, and makes a statement to the teacher and sometimes to fellow students as well. Responsible teachers judge their students’ essays according to the criteria of accuracy, clarity, appropriate scope, and significance. Originality in the sense of "something never before thought or expressed in the history of previous scholarship in Classics" is generally not a criterion, because few undergraduates (and not just they!), however bright, can come up with a new and unique thesis concerning Sophocles’ tragedy Oedipus Tyrannos, Homer’s Iliad and Odyssey, the career of Alexander the Great, the end of the Roman

62

CHANGING THE CENTER OF GRAVITY

Republic, or any of the other much-studied and much-discussed topics likely to be covered in undergraduate courses in Classics. In the absence of this kind of intellectual originality, undergraduate essays are poor candidates for traditional publication.2 And secondary scholarship in Classics often presents inappropriate models for students. For example, this very interesting article would be of interest to students studying Greek or Roman religion: H. S. Versnel, "The Festival for Bona Dea and the Thesmophoria" (Versnel 1992). But in 25 very closely argued pages, Versnel refers to 38 works of 20th century scholarship, while including only 16 citations to ancient evidence (and these include several citations to the same passage). Are we to assume that our knowledge of both the Greek rites of the Thesmophorion and the Roman festival of the Bona Dea stands on a mere handful of primary sources? Of course not, and the author adds a footnote at the end telling us that the article in Greece & Rome is a "highly abridged" prepublication of a chapter in a then-forthcoming book, which refers "to the ample evidence and argumentation" (Versnel 1992, 55 n.1]. But as a publication to be read by students—and why else publish it?—the author’s choices in abridgment are regrettable; the current prepublication emphasizes secondary scholarship over primary sources to an overwhelming extent. Secondary scholarship in Classics aimed at a general readership is notorious for failing to give citations to the actual evidence behind its assertions. Paul Carteledge has written a 29-page chapter on Greek religious festivals that spends eight pages talking about the history of the Olympic Games, including a chart listing athletic events and in what year they were added to the Olympic festival, without ever citing an ancient source for this information (by the way, the source is Pausanias 5.8) (Cartledge 1985).3 We would argue

2 For two recent explorations of challenges and possibilities in undergraduate research and writing, see: (Grobman 2007); (Ro-billard 2006) 3 Cartledge based his version of the "Olympic Register," we suppose, on a century-old article: J. P. Mahaffy, "On the Authentic-ity of the Olympian Register", (Mahaffy 1881). This particular problem of history and its sources has been ameliorated by the publication, in 2007 of Paul

BLACKWELL AND MARTIN

63

that non-specialist readers are those readers most in need of access to the fundamental sources, if for no other reason than that they are least likely to have other, better documented, resources at hand. This is a problem, but it holds an opportunity. Undergraduate students, even those in the earliest stages of their exposure to the ancient world, can undertake the too-often ignored task of ascertaining and explaining the primary evidence underlying particular questions. A paper entitled "What are the ancient sources for the Olympic Register?" would be within the ability of almost any student with guidance from a teacher and some dedicated librarians, especially one who has had at least one semester of Greek. Such an assignment would teach techniques of research, would expose the student to various modes of scholarship, and would be relatively straightforward to organize and write. The results would be useful and at least as worthy of publication as any number of specific arguments about the Olympic Register that may certainly stand on a similar list of sources, but which will not bother to cite them explicitly for the benefit of curious readers. As a publication, even a paper entitled "Some ancient sources for the Olympic Register" that defined the parameters of its data set would have recently been extremely helpful to one of the authors of the present paper, since Mahaffy’s 1881 article does not cite sources not immediately relevant to his argument (see note 3 above). We have experience enlisting undergraduates as authors of scholarship intended for true publication, essays intended to be read not as demonstration-pieces limited in interest by the context of a particular class but as pieces to inform the public and provide a wide audience with accurate and useful insights. The online encyclopedia Dēmos: Classical Athenian Democracy, published by The Stoa, aims to bring the sources for our knowledge of Athenian government in the 5th and 4th centuries to the eyes of a general readership.4 One of the editorial principles of the essays published

Christesen, Olympic Victor Lists and Ancient Greek History, (Christesen 2007). 4 For a more lengthy description of this project, see (Black-well 2004).

64

CHANGING THE CENTER OF GRAVITY

in Dēmos is that every statement must be accompanied by a citation to primary source evidence. There is no claim to exhaustive converage of sources for a given statement, but at no point does the reader have to take the author’s words for granted. As a further help to readers, Dēmos offers short essays about the evidence it cites. Take for example this statement and its cited evidence: In times of crisis, the Assembly was responsible for voting to mobilize, and the first step seems to have been a vote that the trierarchs (ΘΕ΍φΕ΅ΕΛ΅΍) get their ships ready for sea . (Dem. 50.4)

"Dem 50.4" will not convey much to most readers outside the discipline of classics or ancient history. Even if the citation is a hyperlink to an online text of that passage, as it is in Dēmos, and the reader follows the link, she will see this: On the twenty-fourth day of the month Metageitnion in the archonship of Molon, when an assembly had been held and tidings of many serious events had been brought before you, you voted that the trierarchs (of whom I was one) should launch their ships. It is not necessary for me to go into details regarding the crisis which had at that time befallen the state; you of yourselves know that Tenos had been seized by Alexander, and its people had been reduced to slavery.

In order to be an informed critic and analyst of any assertion with "Dem. 50.4" as its evidence, a reader must have some understanding of the context of that passage. So Dēmos includes, beside its citations, links to short descriptive essays describing the nature of the sources. The one that accompanies this citation describes Demosthenes’ speech against Polyclēs; it begins: (Demosthenes, Against Polycles; see also Oratory) Although this speech comes down to us under the name of Demosthenes, it was almost certainly written by Apollodorus, who was suing Polycles. Apollodorus is trying to recover some expenses that he, Apollodorus, had incurred after his term of duty as trierarch (ΘΕ΍φΕ΅ΕΛΓΖ), that is, after his service as a private citizen responsible for outfitting a warship for service in the Athenian navy. Polycles was the man who was supposed to

BLACKWELL AND MARTIN

65

take over the trierarchy after Apollodorus. Hershal Pleasant, "Demosthenes 50", (Pleasant 2003)

This essay, which continues for four more paragraphs, was written by an undergraduate, a classics major in his third year at Furman University. The research involved consisted of his close reading of the speech, with an eye toward explaining the argument and the issues at stake. The essay links to another essay on Greek Oratory generally, which calls attention to the generic problems with these speeches as sources for history. As an assignment for an upper-level seminar on Greek Prose, this piece of writing served to focus the student’s reading of the speech, to exercise his command of written prose, and to demonstrate to his teacher his understanding of the complexities of a dispute over a triearchy. But the product of the assignment is a real publication intended to inform a wider audience. While no different in kind from any number of papers written by students, this work makes a real contribution to knowledge—if not the absolute amount of knowledge in the world, then at least the knowledge of a potentially large number of nonprofessionals interested in understanding Athenian democracy. Of course, the example above falls under the much-maligned category of "popularizing" scholarship, works aimed at bringing a topic to the less-informed masses. The malignity with which this kind of writing is often regarded in the professional community of Classicists seems to us to arise from the (supposedly) lighter demands it places on its authors. But this kind of writing has its own exacting demands—a perceptive synthesis, a broad knowledge of the general context and primary sources, and clear, concise expression free of jargon—and these demands align perfectly with those of most assignments given to undergraduate students. Opportunities for having students contribute the fruits of their research and writing to public forums abound these days—from Wikipedia, whose articles can often profit from educated intervention, to local web-based publications, which can be tightly controlled by their professional editors, and easily discoverable by the world at large, to ongoing, illustrious, and highly regarded web-based projects in collaborative scholarship such as the Suda Online, which has incorporated the work of scores of undergraduate translators, in addition to the millions of words of translation and commentary by professional scholars. (Finkel 2001)

66

CHANGING THE CENTER OF GRAVITY

WHEN ALL THE SOURCES ARE ONLINE If teachers of Classics want to assign students research-based topics for their written work of the kind mentioned above, as opposed to, say, opinion pieces or creative writing, then by the terms of the assignment, the students’ analysis must be based on a defined data set of sources (whether primary or secondary or both). To complete the assignment, the student of course needs access to the relevant data, and therefore the teacher must always pay attention to the availability of sources in making such assignments. In today’s world of Classics, there are numerous and frequent limitations on the availability of the necessary data, whether primary or secondary. In our fields of Classics and ancient history, only a tiny number of college and university libraries even approach (and none actually achieves) the ideal of possessing and making available to their users the full collection in print form of ancient texts and modern scholarship, which is strongly international and multi-lingual in scope. The costs of acquiring, storing, and lending books and articles are too high and the physical problems of deterioration of printed materials too serious for any library to achieve such a goal.5 Now, it is certainly not plausible for a college or university library to possess and make available anything near the totality of primary and secondary sources in our field to serve the needs of teachers in creating research-based assignments in Classics, given the limited time and scholarly expertise of undergraduate students. But the problem of the limitations on the accessibility of data remains pertinent nevertheless. No library, for example, can make available sufficient printed copies of even the most commonly available primary texts to allow a significant number of students to work on the same assignment individually. The recourse is, of course, to ask students to buy copies of books or articles them-

5 For discussions of online sources, undergraduate research-ers, history, and historiography, see: Hanlon, Christopher. 2005. "History on the Cheap: Using the Online Archive to Make Histori-cists out of Undergrads." (Hanlon 2005) Willison, John, and Kerry O’Regan. 2007. "Commonly known, commonly not known, totally unknown: a framework for students becoming researchers". (Willi-son 2007)

BLACKWELL AND MARTIN

67

selves if they are not available freely online. This attempted and always partial solution to the problem of the limited accessibility of sources is itself becoming increasingly less feasible as the cost of books continues to rise and students continually experience severe financial pressures from the escalation of the expense of attending college or university in general. The issue of the accessibility of secondary sources in our field is much more pervasive, even if we confine ourselves to modern scholarship in English. Classics and ancient history scholars writing in English produce a large number of increasingly expensive books (scholarly volumes priced at a hundred dollars or more are now common, even the norm). These books are published not just in the United States but also in the United Kingdom, Australia, and indeed other non-English speaking countries (e.g. Denmark, Holland, and India), and this international dispersion of publishing can makes acquiring them a challenge sometimes even for those willing to pay the price. The situation regarding scholarly articles is much more complicated and expensive still. Scholars in our fields publish in a wide array of English-language journals published here and abroad. The costs of journals has soared in recent years, and few libraries subscribe to more than a very limited subset of those published around the world. Very few journals in our fields are available electronically, whether by paid subscription or without charge. Reading articles in scholarly journals can require travel to major libraries, use of inter-library loan services, or subscription to expensive electronic databases such as JSTOR (http://www.jstor.org/). In short, access to secondary scholarship is often constrained for reasons that are difficult to overcome. For this reason, it is very challenging to make these sources consistently available to students, especially in a group setting, for assignments that are usually due within relatively short intervals of time. The effect of these constraints on the accessibility of source data in our field is to limit the number and type of assignments that can be given to our students with any reasonable expectation that they will be able to complete the work in the time allowed. This fact in turn limits the intellectual goals that we as teachers can set for our students. "When all the sources are online"—the ideal that we envision through the use of electronic technology—is an idea that would change all this. Whether it will actually turn out to be possible to have absolutely all source data online in the future is

68

CHANGING THE CENTER OF GRAVITY

not the point; rather, the issue is how research-based learning for undergraduates in Classics could change if the presently existing constraints on source data were eliminated as far as possible by making primary and secondary sources accessible to any student who has access to the Internet. The first change would be that teachers would be able to make assignments based purely on the intellectual value of the work rather than limit their options to those assignments feasible with the limited print resources at hand. Secondly, it would be possible to teach students how to treat primary sources truly as "primary", meaning to ground arguments and statements in the most foundational evidence extant, while always defining the parameters of the data set being used. Thirdly, students could be taught to examine the evidentiary basis of secondary sources. They would learn the crucial difference between secondary sources that fully and clearly ground their arguments and statements in primary source material and secondary sources that fail to do so. For the former, they could easily access the cited primary sources, so as to control the validity of the interpretation of the primary sources in the secondary source. Exercising this sort of control effectively would naturally require training by experienced teachers who are themselves committed to research-based learning, and it would no doubt take time to learn this skill. But at least the possibility would exist for students to take this major step toward gaining the intellectual power and independence that this approach would support. Admittedly, these sorts of assignments would be onerous work for students and teachers alike, and it is obviously not practical for everyone to track down every primary source for every argument and statement in every case, at least not in reasonable amounts of time. But the very possibility of examining primary sources because they are all available online would change the intellectual climate of undergraduate work. The very effort of examining primary sources and thinking about their possible meanings would bring home the reality that scholarship is always research, in the sense of finding, identifying, interpreting, and presenting evidence. Students could operate as scholars, whether through the process of verifying the plausibility of the presentation of evidence by others, or by presenting arguments and interpretations that are in one way or another original, in all the various senses of that word. In this context, secondary sources that concentrated on mak-

BLACKWELL AND MARTIN

69

ing clear presentations of primary evidence would become especially valuable to students because they would model the behavior for this approach. When all the sources are online, then we as teachers of Classics can more effectively engage our undergraduate students as collaborators in research, whether in the collection of, for example, themed primary source collections, or in the interpretation of the countless issues in Classics and ancient history that still await effective investigation based on careful analysis of well-chosen and clearly defined data sets rather than impressionistic assertions. When all the sources are online, the way students and teachers do their work together will change dramatically, and for the better. Another but related point for the near-term future: a concerted and visible effort to put all sources online might have a salutary effect on curbing one of the most frustrating and careless habits of secondary scholarship in our field, one whose unthinking arrogance places onerous and completely unnecessary hurdles in front of students and faculty at the great majority of colleges and universities in their attempts to pursue research-based learning that emphasizes identifying primary sources at the most basic level possible, a research method that requires being able to see the full source of a cited excerpt of text so that its context can be evaluated. That habit is the practice of listing excerpts from "lost" (i.e., no longer directly preserved) ancient works (so-called fragments) and evidence about them or their authors (so-called testimonia) by the arbitrary numbers assigned the fragments and testimonia in modern works collecting the preserved remains of such lost texts. Giving arbitrary numbers to fragments, such as "Fragment 1 of author such-and-such," is tantamount to asserting that the fragments are somehow free-standing textual entities, when in fact they are for the great majority of cases simply quotations or paraphrases from other, still extant texts. (We leave aside fragments that are indeed fragmentary pieces, found in a partially preserved papyrus or in an inscription.) In truth, of course, the work to which a fragment originally belonged simply no longer exists. It is well and truly lost, unless by some near-miracle a previously unknown copy turns up in a manuscript buried in a library or recovered from a papyrus found in an archaeological excavation. A fragment of an ancient historian or comedian, for example, that is embedded in, say, the text of Athe-

70

CHANGING THE CENTER OF GRAVITY

naeus’ extant work Sophists at Dinner is in reality only part of Athenaeus, not a free-standing text. Assigning a number to this excerpt in a modern collection of fragments does not alter that basic fact. To study the fragment in the spirit of research-based inquiry that we envision, it is always necessary to consider this bit of text in the context in the larger text of which it is a part. This being the case, it is astonishing and distressing that modern secondary scholarship in Classics still tends to refer to fragments only by their arbitrarily assigned fragment numbers and often neglects to give in addition the reference to the text to which the fragment in fact belongs, to the true source of the fragment. This habit makes it difficult to track down the true location of a fragment unless one is in the exceedingly fortunate position of personally owning or having immediate access through a major library to the modern collections of fragments in which the actual sources of the fragments are ultimately revealed. In scholarship on Greek ancient history, to describe the situation in our particular specialty, it remains standard practice to cite fragments from the works of lost ancient Greek historians by the numbers assigned to the fragments by Felix Jacoby in his monumental collection of the remains of "fragmentary ancient Greek historians," Die Fragmente der griechischen Historiker (Jacoby 2004). Therefore, a student or researcher reading a work of modern scholarship on an ancient Greek historical topic is extremely likely to encounter a reference to this collection Suppose, for instance, that a student in a course on freedom and tyranny in ancient Greece becomes interested in the colorful and controversial career of Dionysius I, the (in ancient Greek terminology) "tyrant" of Syracuse in Sicily in the classical era. There is controversy over his career concerning just how (in modern terms) "tyrannical" Dionysius I really was. To learn more in order to address the question of the nature of the rule of Dionysius I and having been in this case warned away from Wikipedia by her solicitous instructor, our enterprising student begins her inquiry by looking up his name in the standard one-volume print encyclopedia on ancient Greece and Rome commonly recommended by teachers and found in nearly every library in the land, the third edition of the Oxford Classical Dictionary (Oxford 1996). The student reads to the very end of the article on Dionysius I because her teacher has impressed on her the crucial need to try to find the sources on

BLACKWELL AND MARTIN

71

which encyclopedia articles are based. There, on p. 1526, the student finds a reference to "*Timaeus (2)" as a source for the career of the alleged tyrant. Being admirably industrious, the student is aware that the asterisk indicates a cross-reference to another article in the encyclopedia. Turning to the article on Timaeus (pp. 15261527), our researcher discovers that Timaeus was "the most important western Greek historian" and that being "a conservative aristocrat, [Timaeus in his work Sicilian History] distorted not only the historical picture of Agathocles [another tyrant], who had exiled him (fr. 124), but also of other tyrants, e.g. *Hieron (1) I and Dionysius I (frs. 29, 105)." The article has nowhere explicitly said that Timaeus’ Sicilian History is a lost work, but our attentive student deduced this melancholy fact from the earlier comment in the article that Timaeus’ history "is known through 164 fragments, the extensive use of it made by *Diodorus (3) (4-21 for the Sicilian passages), and *Polybius (1)’s criticism in book 12." Where, then, the inquiring mind of our researcher wants to know, are the fragments to be found? Because her teacher has told her about the situation concerning the publication of fragmentary ancient Greek historians, she realizes that the notation "FGrH 566" at the beginning of the bibliography is the key to the mystery of the locations of the fragments. The fragment numbers cited in the article, she realizes, refer to Jacoby’s massive collection. If to verify the accuracy of the encyclopedia article she wants to know what Timaeus actually said in allegedly distorting the picture of Dionysius I, all she has to do is to read fragments nos. 29 and 105 under historian no. 566 in FGrH. The problem is that her library does not own a copy of this multi-volume reference work. Why they don’t own it is easy to understand: a quick glance at the web site of its publisher reveals that an institutional license for the CD-ROM version costs US$3,159.00. If she did by chance have access to FGrH, she could easily find out that Timaeus fr. 29 is actually a scholion (a later scholarly comment) on the tenth chapter of the second oration extant from the stylus of the fourth-century BCE Athenian politician Aeschines, while Timaeus fr. 105 is from the second-century CE essayist and biographer Plutarch, in the collection of his essays known as the Moralia, at a location (717C) easily found through the canonical reference system applied to this collection. Finding a text of the scholion would remain challenging, as these scholarly comments

72

CHANGING THE CENTER OF GRAVITY

are themselves available in hard-to-find scarce print editions. Finding the Plutarch passage would be relatively easy, however, as the Moralia are available in the commonly available and affordable Loeb Library (which conveniently includes an English translation along with the Greek original). Being able to find only one of the two primary sources that she needs might be frustrating to our motivated researcher, but it is a considerable improvement over not even knowing what the relevant primary sources are, which is the case when one is confronted with references such as "FGrH 566 frs. 29 and 105." At the very least, scholars who subscribe to the value of the ideal of "When all the source are online" would, we hope, never fail to cite the underlying primary source when they need to refer to a fragmentary text. To choose as an example a citation from a book written by one of our organizers, this is the way citation of fragmentary ancient Greek authors should be done: "According to Ion of Chios (FGrH 392 fr. 14 = Plut. Kimon 16.8), Kimon 'inspired the Athenians most of all by calling upon them neither to leave Hellas lame nor to stand by and watch their own city lose its yokefellow'" (Crane 1996, 112). Thanks to this form of citation, in which the underlying primary source of a fragment is indicated, readers do not need access to FGrH to learn that they can read the fragment of Ion of Chios in an easily found biography by Plutarch. When authors fail to cite the primary sources for fragmentary works and a researcher has no access to the referenced modern collection of fragments, the only alternative for finding the underlying primary source is to try to identify the fragments by choosing likely words to search for in the Thesaurus Linguae Graecae database, which is a cumbersome and non-comprehensive option, even assuming that the researcher has access to the not-inexpensive TLG. If in all scholarly publications authors took the trouble of listing all primary sources for all fragments, in the spirit of "When all sources are online," then students and faculty, regardless of the state of their library, would have a much better chance of pursuing research-based study of fragments in which they were interested. Why? Because it is much more likely that their libraries would own copies of the basic texts from which many of the fragments come than that they would own extremely costly modern collections of fragments such as FGrH, or, to give another example lest it appear that FGrH is an outlier in price, the collection of fragmentary an-

BLACKWELL AND MARTIN

73

cient Greek comedies, Poetae Comici Graeci, which costs US$2699.20. To conclude this section: a well-publicized effort to reach the time "When all sources are online" might, to be blunt, help shame scholars into doing their utmost to include primary source citation to the greatest degree possible in all their works, whether in the publication itself or in some complementary medium, as, for example, on a web site meant to accompany the main publication (in the spirit of the supplementary "special features" frequently added to films on DVDs these days). With the ideal of "When all sources are online" as an inspiration, we might then hope to avoid situations such as the one in which a recently published book in our field was slammed in a review for its almost total failure to cite primary sources, thereby preventing the reviewer—and future readers, if there are any—from investigating its evidentiary validity (BMCR 2007.09.52). "When all sources are online" can and should be a rallying cry of the kind that the founders of Perseus used twenty years ago when trying to garner external support for work that our discipline at that time condemned and ridiculed: "Democratize access to information!" That is the goal of always thinking, "When all the sources are online."

FROM EACH ACCORDING… Between 1889 and 1907, the Homeric scholar and editor T. W. Allen published a series of articles, each of which amounted to a list of Homeric manuscripts that he had found and identified in the various libraries of Italy: 1. T. W. Allen, "Notes upon Greek Manuscripts in Italian Libraries," The Classical Review 3, no. 1/2 (February 1889): 12-22. 2. T. W. Allen, "Notes on Greek Mss. in Italian Libraries," The Classical Review 3, no. 6 (June 1889): 252256. 3. T. W. Allen, "Notes on Greek MSS. in Italian Libraries (Continued)," The Classical Review 3, no. 8 (October 1889): 345-352. 4. T. W. Allen, "Notes on Greek MSS. in Italian Libraries (Continued)," The Classical Review 4, no. 3 (March 1890): 103-105.

74

CHANGING THE CENTER OF GRAVITY

5. T. W. Allen, "Manuscripts of the Iliad in Rome," The Classical Review 4, no. 7 (July 1890): 289-293. 6. T. W. Allen, "Recent Italian Catalogues of Greek MSS," The Classical Review 10, no. 5 (June 1896): 234-237. 7. T. W. Allen, "Aristarchus and the Modern Vulgate of Homer," The Classical Review 13, no. 9 (December 1899): 429-432. 8. T. W. Allen, "New Homeric Papyri," The Classical Review 14, no. 1 (February 1900): 14-18. 9. T. W. Allen, "Characteristics of the Homeric Vulgate," The Classical Review 16, no. 1 (February 1902): 1-3. 10. T. W. Allen, "New Homeric Papyri," The Classical Review 18, no. 3 (April 1904): 147-150. 11. T. W. Allen, "The Greek MSS. in the Ambrosian Library," The Classical Review 21, no. 3 (May 1907): 83-85 This impressive list of publications from a most eminent scholar should alone be enough to justify the compiled index as a legitimate genre of publication, but there is no shortage of other examples, such as the invaluable (and ongoing) publication of: John F. Oates, Roger S. Bagnall, Sarah J. Clackson, Alexandra A. O'Brien, Joshua D. Sosin, Terry G. Wilfong, and Klaas A. Worp, Checklist of Greek, Latin, Demotic and Coptic Papyri, Ostraca and Tablets (Oates n.d.) Lists and indices are valuable contributions to scholarship, more valuable perhaps than many tightly woven arguments on matters of interpretation. As more resources become more widely available through open-access publication, and as end-use software applications become increasingly able to draw their data from diverse and distributed sources, lists and indices will play an ever more central role in the universe of knowledge. This is a genre to which undergraduate researchers can easily contribute. All they need is guidance and access. Guidance should take the form of a professional scholar and teacher describing a need, since this is something an undergraduate will not be likely to identify alone. Access can come in many forms, from the more romantic and exotic, such as Allen’s decade-long sojourn among Italian libraries at the end of the 19th century, to the more mun-

BLACKWELL AND MARTIN

75

dane, such as Oates, et al., combing journals and monograph series for newly published papyri and ostraka.6 In the summer of 2007, as the team from the Biblioteca Nationale Marciana in Venice, the Center for Hellenic Studies in Washington, DC, and the British Library brought home new, highresolution digital images of three Homeric manuscripts from the Library of St. Mark, a student in his third year of Greek at Furman University volunteered to produce indices based on this new access to these manuscripts. One of them, the Venetus A [Marcianus Graecus Z. 454 (= 822)] was already well documented; the other two, the Venetus B, much less so.7 James Lanier produced six indices, two for each manuscript. For each, one index consisted of a series of rows, each with two cells. One cell contained a citation to the line of the Iliad, written as a CTS URN8; the other column listed the manuscript’s id, the folio, and side: urn:cts:greekLit:tlg0012.tlg001:1.17 msA-12r urn:cts:greekLit:tlg0012.tlg001:1.18 msA-12r

The other index associated folio-sides of each manuscript with an image of that folio; so for Venetus A, folio 12 recto, there are four images, the full page in natural light and ultraviolet light, and two details: VA012RN-0013 msA-12r VA012RND-0892 msA-12r VA012RUV-0893 msA-12r

We are certainly not alone in advocating in finding fruitful areas for collaboration between students and faculty in the area of undergraduate research in humanist areas: (Macdorman 2004), (Norcia 2008), (Stephens 2005), (Thomas 2008). 7 Marcianus Graecus Z. 453 (= 821) and what T.W. Allen identified in his edition of the Iliad as U4 (Marcianus Graecus Z. 458 (= 841). 8 CTS URN: A "Canonical Text Services Universal Resource Name, a concise method of identifying with precision a particular passage of a particular text"; see chs75.harvard.edu/diginc 6

76

CHANGING THE CENTER OF GRAVITY VA012RUVD-0894 msA-12r

These indices not only contribute to the online publication of these images (http://chs.harvard.edu/chs/manuscript_images), but serve a variety of other scholarly purposes as well; for instance, from them, we can determine at which points the scribe skipped lines when moving from one folio to another [for example, on the Venetus B, Marcianus Graecus Z. 453 (=821), folio 140 recto ends with Iliad 10.530 and 140 verso begins with 10.532]. From here, it is not difficult to imagine countless other rigorous and challenging tasks, requiring knowledge of Greek but within the capabilities of smart undergraduate students, that might promote and enhance scholarship on these images. An index associating regions of each image to discrete passage of text — where, on a given image, can we find the notes to a particular line of the poem? On which folios, and where on the images thereof, are there diagrams? Projects like this can promote a healthy collaboration between students and their teachers, and can be rewarding and exciting to students in direct proportion to the extent to which the value of such work is celebrated. If all scholarship is expected to take the form of an argument, then an accomplishment like Lanier’s becomes menial (but terribly difficult and time-consuming) drudgery. Seen in the context of traditionally honored work such as Allen’s catalogues of Homeric manuscripts in Italian libraries, and acknowledged as a vital contribution to the future of the discipline, projects like this can be thrilling.

SHAKING THE FOUNDATIONS The introductory page of the Homer Multitext Library site (http://chs.harvard.edu) describes the project thus: "The Homer Multitext project, the first of its kind in Homeric studies, seeks to present the textual transmission of the Iliad and Odyssey in a historical framework. Such a framework is needed to account for the full reality of a complex medium of oral performance that underwent many changes over a long period of time. These changes, as reflected in the many texts of Homer, need to be understood in their many different historical contexts. The Homer Multitext provides ways to view these contexts both synchronically and diachronically." The Homer Multitext Library (hereafer, HMT), in its fundamental set of data, stands one hundred and fifty years of phi-

BLACKWELL AND MARTIN

77

lological scholarship on its head: While enormous effort on the part of classical philologists has been spent comparing manuscript "variants" in an effort to describe an "original text," the HMT seeks out different texts of the Homeric poems and seeks to preserve, highlight, and understand their very points of difference. The scholarly background and philosophical foundation of the HMT is treated elsewhere.9 We are interested in discussing one portion of the work of building this library. In 2006, Professor Case Dué of the University of Houston, one of the editors of the HMT, secured a grant to pay undergraduate research assistants to push forward work on the project. Dué secured the collaboration of colleagues at the College of the Holy Cross and Furman University, and set two teams of undergraduates to work on the texts of the Homeric Iliad. These undergraduates, the HMT Fellows, were assigned the task of preparing transcripts of the specific texts of five Byzantine and medieval manuscripts of the Iliad. According to Allen’s sigla, these are: 1. A (= Venetus 454, 10th c.) 2. B (= Venetus 453, 11th c.) 3. T (= British Museum, Burney 86, ad 1059) 4. E3 (= Escorialensis 291, 11th c.) 5. E4 (= Escorialensis 509, 11th c.) Since it was utterly impossible for the HMT Fellows to work from autopsy of the manuscripts themselves, they secured the socalled Editio Maior ("Greater Edition") of T.W. Allen’s Homeri Ilias. (Allen 1931) This three volume critical edition of the Iliad, in its critical apparatus, notes all significant manuscript variants, albeit in a highly compressed format. The HMT Fellows divided the books of the Iliad between the two teams, and set to work. At Furman University, the students photocopied Allen’s edition and put the photocopies into a 3ringed binder; they then interleaved each page of Allen with a lined page for notes. They then began with a close reading of Allen’s apparatus, looking for sigla referring to variants in any of the five

9

See http://chs.harvard.edu, and the chapter in (Nagy 2004).

78

CHANGING THE CENTER OF GRAVITY

manuscripts with which they were concerned. These references, when they found them, they marked using colored highlighting pens: pink for B, yellow for T, green for E3, blue for E4, orange for A. Where they found references in the apparatus the noted places where the text of a certain manuscript differs from Allen’s edited text, they noted the difference on the page of notes. The matter was not always straightforward. Allen’s apparatus will sometimes report a variant reading as being "vulg.", for "vulgate", or as appearing in "codd.", for "the (Byzantine and medieval) codices." Variants may appear as corrected text on a manuscript, and be recorded by Allen as "B corr.". The HMT Fellows had to master this cryptic discourse. The apparatus is compressed, and the compression is "lossy" at times. For example, for Book 1, line 93, Allen’s apparatus reads, in part, as follows: 93 ... ΓЄΘΤΕ A: ΓЄΘ’ ΩΕ’ (ΪΕ) vulg.

In other words, Manuscript A has ΓЄΘΤΕ, while the "medieval vulgate" has either ΓЁΘ’ ΩΕ’ or ΓЄΘ’ ΪΕ. Allen has lumped together a whole category of manuscripts, without differentiating which have ΓЁΘ’ ΩΕ’ and which have ΓЄΘ’ ΪΕ, because he does not think the difference is significant enough to preserve, given the economic realities he faced. But anyone interested in a serious study of variant texts among Byzantine and medieval manuscripts might well be very interested in where we see an acute accent and an apostrophe, and where we see a grave accent and no apostrophe. That data is lost, as far as we readers are concerned, although T.W. Allen had that information at his disposal. So the HMT Fellows have been careful to characterize their work as producing facsimiles of A, B, T, E3, and E4 according to the apparatus of Allen’s editio maior. Their transcriptions will necessarily fall short of capturing perfectly the texts of the manuscripts, but should nevertheless serve well as the basis for initial comparisons, and as drafts for further revision, as future scholars gain access to the manuscripts or good images of them. Having marked the variants in the apparatus and written them down on the facing pages of notes, the HMT fellows entered those changes into their working-copies of the Iliad. They began with five identical electronic texts of the Iliad, the text of Allen’s edition,

BLACKWELL AND MARTIN

79

taken from the Thesaurus Linguae Graecae and edited with a bare subset of the Text Encoding Initiative’s document type definition. (Sperberg-Mcqueen 2004) The Fellows edited each of these according to the variants found in Allen’s apparatus for one manuscript. Where the text of a manuscript was simply different from Allen’s, they made the change in that manuscript’s XML file with no comment or further markup. Where Allen noted correcting hands or other editorial intervention in the original manuscript, the Fellows added the text and markup, following the EpiDoc guidelines wherever possible. (Elliott) We will present one example of their work, which should serve both to illustrate the importance of this approach to the Homeric texts and to highlight the depth and rigor of the scholarly contribution that these undergraduate research fellows are making. The example begins with Allen’s apparatus at the entry for Iliad 1.97. At this point in the poem, the seer Calchas is explaining to the Greeks why Apollo has afflicted them with a plague. He says that Apollo is angry over how the Greek king Agamemnon treated one of Apollo’s priests, specifically, that Agamemnon would not return the priest’s captured daughter, even in exchange for a generous ransom. Line 97, in Allen’s edited reads: ΓЁΈ’ ϵ ·Ή ΔΕϠΑ ̇΅Α΅ΓϧΗ΍Α ΦΉ΍Ύν΅ ΏΓ΍·ϲΑ ΦΔЏΗΉ΍ [Apollo] will not drive off the loathsome pestilence from the Danaans until…

In the apparatus, however, we see this long, difficult note: 97 ΓЂΘΝΖ ̝ΕϟΗΘ΅ΕΛΓΖ· Ύ΅Ϡ ψ ̏΅ΗΗ΅Ώ΍ΓΘ΍Ύχ Ύ΅Ϡ ψ ͦ΍΅ΑΓІ [Ύ΅Ϡ ΗΛΉΈϲΑ ΔκΗ΅΍ add. Li T] ΘϲΑ ΅ЁΘϲΑ σΛΉ΍ ΘΕϱΔΓΑ· ρΓ΍ΎΉ ΓЇΑ ψ οΘνΕ΅ ̉΋ΑΓΈϱΘΓΙ ΉϨΑ΅΍ ψ ΓЁΈ’ ϵ ·Ή ΔΕϠΑ ΏΓ΍ΐΓϧΓ Ά΅ΕΉϟ΅Ζ ΛΉϧΕ΅Ζ ΦΚνΒΉ΍ S A T : hanc codd. (Ώ΍ΐΓϧΓ Ca2 O2 P3 Pal(2) R7 U6 Vi1: πΚσΒΉ΍ C Vi6)

Freed from any constraints of space, we may translate and expand the text and note thus: "[Apollo] will not drive off the loathsome pestilence from the Danaans until…"

— a marginal note, or scholion, on Manuscript "A" and another on Manuscript "T" both have this to say about the line just

80

CHANGING THE CENTER OF GRAVITY

quoted: The 2nd century BCE scholar Aristarchus has the line this way, as just quoted, and so do both the version of the Iliad from Massalia and the edition made by the 2nd century BCE scholar Rhianus (and almost all the others. This last phrase is added by a note that appears on the manuscript "Li" and is echoed on the manuscript "T"). And so it seems that the edition of Zenodotus is the different one, because it has this line: "[Apollo] will not lift his heavy hand of plague until…": and this last version is what all the medieval manuscripts have (with a certain number spelling the word for "plague" differently, and a couple having a slightly different version of the verb). In yet other words, there are two utterly different versions of this line floating around the ancient and medieval world. Both are Greek; both are poetry; both make sense. Marginal notes on various medieval manuscripts are our evidence for these two lines – notes that preserve the contents of editions and commentaries on Homer that date back to the library at Alexandria. One version appeared, evidently, in the edition of Zenodotus, the earliest Alexandrian scholar of Homer, in the 3rd century BCE This same line appears in every medieval bound manuscript (that is what the abbreviation "codd." means, "codices"). Another version appeared in the editions of Aristarchus of Samothrace, the 2nd century BCE librarian of Alexandria who was the greatest ancient scholar of Homer. This version also appeared in the city-edition of the Iliad from the city of Massalia, what is now Marseilles. And the edition of Rhianus (2nd century BCE) also contained this line. [Note: the actual scholion says, "̇΅Α΅ΓϧΗ΍Α ΦΉ΍Ύν΅ ΏΓ΍·ϲΑ ΦΔЏΗΉ΍." ΓЂΝΖ ΅ϡ ̝Ε΍ΗΘΣΕΛΓΙ. The plural definite article ΅ϡ suggests that the two "editions" (΅ϡ πΎΈϱΗΉ΍Ζ) compiled by Aristarchus included this reading.] So, the 3rd century BCE editor of Homer and all the medieval manuscript witnesses say X, while two 2nd century BCE editors and a "city edition" say Y.> Allen picks Y, relegating X to the rhetorical "basement" of his apparatus. The Homer Multitext Fellows were faced with no such choice. They "restored" the proper text into its place on each of the manuscripts they are transcribing; the scholiastic texts, which preserve the other valuable reading, will be transcribed independently, with comparison of this "horizontal variant" made accessible by means of end-user applications that draw on all of this data.

BLACKWELL AND MARTIN

81

The editors of the HMT are, of course, deeply interested in the precise contents of the ekdoseis, or editions, of Aristarchus, but they are specifically interested in how that scholar’s editions differ from the medieval vulgate, since such a "drift" of the language of the poem over a millennium supports the notion of an ongoing tradition of multiformity. T.W. Allen’s choice—a choice determined by the conventions of the traditional critical edition— obscures that difference. So, in their XML transcriptions of medieval manuscripts, the work of the HMT Fellows will highlight a problem in the history of the Homeric text, thus contributing a point of conversation and analysis to the ongoing study of the Iliad. The work of assembling transcriptions from Allen’s apparatus is a valuable start to reproducing specific manuscripts as XML files. The next step would be to remove the need to have Allen mediate between our research and its objects. After May of 2007, the HMT Fellows had access to preliminary versions of digital images of the A and B manuscripts, taken at the Biblioteca Nationale Mariciana. These they found helpful in decyphering some of Allen’s more cryptic notations regarding those manuscripts. And they found some places where Allen’s apparatus was not entirely precise. For example, at 11.525, Allen’s text reads: ̖ΕЗΉΖ ϴΕϟΑΓΑΘ΅΍ πΔ΍ΐϠΒ ϣΔΔΓ΍ ΘΉ Ύ΅Ϡ ΅ЁΘΓϟ. The Trojans were driven in confusion, both their horses and themselves. (Comparetti 1901) (Allen 1931, 11)

Allen’s apparatus notes that Manuscript A has ΅ЁΘΓϟ in ras., that is, written over an erasure; other texts, according to the apparatus, have "both the horses and men" (Ύ΅Ϡ ΩΑΈΕΉΖ), or "both the horses and others" (Ύ΅Ϡ ΩΏΏΓ΍). This notice "in ras." moved the HMT fellows to look at the image of folio 147-verso of the Venetus A. Here they saw that the words ΘΉ Ύ΅Ϡ ΅ЁΘΓϟ, "and themselves" were indeed written over the erasure, but that the erasure was in fact almost three times as long as that phrase, far longer than necessary if the erased text were either of the alternative texts given in Allen’s apparatus. Allen, too, had noted only the last word, ΅ЁΘΓϟ, as having been written over the erasure, which was clearly not the case. So whatever Manuscript A originally had, it replaced ΘΉ Ύ΅Ϡ ΅ЁΘΓϟ and contained many more letters than that phrase. Our undergraduate researchers noted that fact and recorded it in

82

CHANGING THE CENTER OF GRAVITY

their XML transcription. This information, now discoverable and machine-readable, will be new information to anyone who has relied on the century-old Comparetti Facsimile of the manuscript, which, as T.W. Allen says, "imperfectly renders erasures and corrections.". (Comparetti 1901), (Allen 1931, 11) The task of reading and transcribing the texts of specific manuscripts is skilled work, but easily within the abilities of good, advanced students of Greek, once they have some familiarity with the language of Homeric poetry, and have access to some reference materials on Byzantine palaeography. That this work is valuable scholarship needs little argument, and certainly is not limited to advocates for any particular school of interpretation, or to devotees of technological innovation in humanities. T.W. Allen, in the closing paragraph of his editio maior of the Iliad, that most imposing monument of traditional scholarship, unintentionally presents an argument for a project precisely like that of the Homer Multitext Fellows. He has listed a number of categories of textual phenomena that he has intentionally ignored in compiling his apparatus criticus; these include things like mute iota, ΑΙ ephelcysticon (that is, the letter "n" added to the end of a word for the sake of euphony), accented versus unaccented ΘΉ and Ϲ΅, and questions of accentuation on words such as ΩΕ, πΐΉ, ΐ΍Α, ΑΙ, Γϡ, Θ΍Α’, Θ΍Ζ, ΐΉΙ, ΗΉΙ, Ό΋Α, ΔΝΖ, ΔΓΙ, Δ΋, accentuation on Aeolisms Ύ΅·, Ύ΅Έ, Ύ΅Ύ, and Ύ΅ΐ, and accentuation on the various forms of the verbs ΉϢΐϟ and Κ΋ΐϟ. (Allen 1931, 272, 226-247). He explains these omissions thus: Various reasons made these omissions necessary: to lighten the apparatus, which would have swelled to almost unprintable proportions; the fact that the collations, though considerable, were not exhaustive and therefore did not admit of statistical conclusions; and that the phenomena belong to the history of medieval Greek accentuation and the usage of Byzantine scribes rather than to the Homeric texts. On these subjects further I have paid little attention to the evidence of quotations, whether in scholia, which being divorced from their context are peculiarly at the mercy of copyists, or of authors, especially in the older editions where the editors may be suspected of adding conventional prosody. (Allen 1931, 272)

BLACKWELL AND MARTIN

83

Allen limited his apparatus due to the constraints of the printed text, lest it become "unprintable." His scholarship was further limited by the fact that he was its sole author; he could not do exhaustive collation (by himself), and had to rely on secondary scholarship (earlier editions) for manuscripts that he did not collate himself. Allen’s apparatus is, therefore, represents a least-commondenominator, limited by the most careless of the earlier editors on whose work he relied. He admits that the omitted information would be the subject for statistical analysis, were it collected in a systematic way. But his most regrettable criterion for omission is that of "interest"—he excluded material that he deemed of interest only for Byzantine palaeography and bibliography, and of interest only to scholars of medieval Greek accentuation. A humanist scholarship that is unwilling to divide itself along strict (but strictly arbitrary) lines — where the "Homeric text" of a manuscript is somehow divorced from questions of "the usage of Byzantine scribes" and "the history of medieval Greek accentuation" — should take note of Allen’s list of obstacles and work to overcome them. The answer, we think, is clear: model of collaborative research, the products of which are electronic texts (not required to be "printable") in transcription (rather than collation), involving scholars who may be senior professors or juniors at a liberal arts college, working with high-quality images of the primary texts, the papyri, the Byzantine and medieval manuscripts. The rewards of such work are manifold. The results would be subject to statistical analysis, and any other kind of analysis that interested readers might envision, even if they are asking questions that have never occurred to the editors of these electronic texts. The undergraduate members of such a team have the experience of engaging without mediation the very stuff of philology, the most ancient witnesses to a literary tradition; the task is within their abilities while being extremely challenging, and they know that they are doing real work of real value, not merely exercising an arbitrary set of skills before the judging eyes of a single teacher. The faculty of Furman University and the College of the Holy Cross who have worked with the Homer Multitext Fellows as they meticulously transcribe these texts and explore the problems that those texts reveal have witnessed a degree of dedication and excitement that turns the glowing rhetoric of undergraduate research from a marketing pitch to an honest appraisal.

84

CHANGING THE CENTER OF GRAVITY

CONCLUSION Because technology has lowered the economic barriers to academic publishing—a reality that too few publishing Classicists have fully understood — it is easy to guide student - writers into becoming student-authors. We who teach Classics can add to our pedagogy the technological tools of the information economy, thus arming ourselves against charges of impracticality and at the same time possibly attracting students whose interests lie outside the Classics. And as digital libraries begin to inter-operate, they breathe new life into largely disregarded scholarly genres and invent entirely new ones — geographic information systems, computational linguistics, and so forth. We have presented some very specific examples of the kind of work, and the kind of thinking, that we have found to be fruitful in encouraging undergraduates in their research. We believe in scholarship; we believe that scholarship should be rigorous; we believe that scholarship demands precision and dedication; but we also believe that scholarship can assume many useful forms, and we are convinced that scholarship, if done properly, should not seem like the kind of burdensome task the ancient Greeks called a ponos. In the community of professional scholars, each must find for her- or himself the motivations for doing the reading, thinking, and typing necessary to produce an article, a monograph, an edition. But our students have not yet made any such commitment, and many of them never will, preferring to find their lives outside of the academy. But even to those students, and perhaps especially to them, we have an obligation, to help them experience scholarship that is not a ponos. It is our experience that the closer we can bring our students to the real sources of knowledge — the ancient texts, the archaeological remains, the papyri and parchment — and the real reward of scholarship — the joy of producing a piece of work that one knows will be discovered and read with interest and pleasure by people we may never meet — the closer we can bring students to the experience of being true scholars, working beside other scholars, the more enthusiasm we find. Rather than students writing five-paragraph essays under constant suspicion of plagiarism in order to win the dubious prize of a high grade among already inflated grades, we prefer to see a student reading a speech in Greek and summarizing it for a non-Greek-reading audience, a student compiling perhaps for the first time the primary source texts for a problem in ancient history, a student paging through an

BLACKWELL AND MARTIN

85

11th century manuscript and noting the text that appears on each folio, a student correcting the text-criticism of one of the great classicists of the 20th century. Seeing these makes our teaching, like our scholarship, seem less a ponos and more a joy.

BIBLIOGRAPHY Allen 1889 Allen, T. W. "Notes upon Greek Manuscripts in Italian Libraries," The Classical Review 3, no. 1/2 (February 1889): 12-22. Allen 1889a Allen, T. W. "Notes on Greek Mss. in Italian Libraries," The Classical Review 3, no. 6 (June 1889): 252-256. Allen 1889b Allen, T. W. "Notes on Greek MSS. in Italian Libraries (Continued)," The Classical Review 3, no. 8 (October 1889): 345-352. Allen 1890 Allen, T. W. "Notes on Greek MSS. in Italian Libraries (Continued)," The Classical Review 4, no. 3 (March 1890): 103-105. Allen 1890a Allen, T. W. "Manuscripts of the Iliad in Rome," The Classical Review 4, no. 7 (July 1890): 289-293. Allen 1896 "Recent Italian Catalogues of Greek MSS," The Classical Review 10, no. 5 (June 1896): 234-237. Allen 1899 "Aristarchus and the Modern Vulgate of Homer," The Classical Review 13, no. 9 (December 1899): 429-432. Allen 1900 Allen, T. W. "New Homeric Papyri," The Classical Review 14, no. 1 (February 1900): 14-18. Allen 1902 Allen, T. W. "Characteristics of the Homeric Vulgate," The Classical Review 16, no. 1 (February 1902): 1-3. Allen 1904 Allen, T. W. "New Homeric Papyri," The Classical Review 18, no. 3 (April 1904): 147-150. Allen 1907 Allen, T. W. "The Greek MSS. in the Ambrosian Library," The Classical Review 21, no. 3 (May 1907): 83-85. Allen 1931 Allen, T. W. Homeri Ilias. 3 volumes. Oxford: Clarendon Press; 1931. Blackwell 2004 Blackwell, C. "Opening a Door on Athenian Democracy," in The New England Classical Journal 31.1, February 2004. 23–32. Cartledge 1985 >Cartledge, P., "Greek Religious Festivals," in Greek Religion and Society, eds. P. E Easterling and J. V Muir (Cambridge: Cambridge University Press, 1985), 98–127.

86

CHANGING THE CENTER OF GRAVITY

Christesen 2007 Christesen, Paul. Olympic Victor Lists and Ancient Greek History (Cambridge University Press, 2007. Comparetti 1901 Comparetti, Domenico ed., Homeri Ilias cum Scholiis, (A. W. Sijthoff: 1901). Crane 1996 Crane, Gregory. The Blinded Eye: Thucydides and the New Written Word. Lanham, Maryland: Rowman and Littlefield, 1996. Elliott Elliott, T., et al. EpiDoc: Epigraphic Documents in TEI XML. http://epidoc.sourceforge.net. Finkel 2001 Finkel, Raphael, William Hutton, Patrick Rourke, Ross Scaife, Elizabeth Vandiver, edd. 2001. The Suda Online (A. Mahoney and R. Scaife, edd., The Stoa: a consortium for electronic publication in the humanities [www.stoa.org]). Grobman 2007 Grobman, Laurie. 2007. "Affirming the Independent Researcher Model: Undergraduate Research in the Humanities." CUR Quarterly 28, no. 1. http://www.cur.org/Quarterly/sept07/Fall07Grobman .pdf. Hanlon 2005 Hanlon, Christopher. 2005. "History on the Cheap: Using the Online Archive to Make Historicists out of Undergrads." Pedagogy 5, no. 1 (January 1): 97-101. Hu 2007 Hu, Shouping, George Kuh, and Joy Gayles. 2007. "Engaging Undergraduate Students in Research Activities: Are Research Universities Doing a Better Job?" Innovative Higher Education 32, no. 3 (October 24): 167-177. Ishiyama 2007 Ishiyama, J. 2002. "Does Early Participation in Undergraduate Research Benefit Social Science and Humanities Students?" College Student Journal 36, no. 3: 380387. Jacoby 2004 Jacoby, Felix. Die Fragmente der griechischen Historiker. Leiden and Boston: E.J. Brill, 2004 [CD-ROM]. Kinkead 2003 Kinkead, Joyce. 2003. "Learning Through Inquiry: An Overview of Undergraduate Research." New Directions for Teaching and Learning 2003, no. 93: 5-18. Lopatto 2006 Lopatto, David. 2006. "Undergraduate research as a catalyst for liberal learning." Peer Review 22, no. 1: 22-26. Macdorman 2004 Mcdorman, Todd. 2004. "Undergraduate Research in the Humanities: Three Collaborative Approaches." CUR Quarterly 25, no. 1: 39-42.

BLACKWELL AND MARTIN

87

Mahaffy 1881 Mahaffy, J. P. "On the Authenticity of the Olympian Register," The Journal of Hellenic Studies 2 (1881): 164-178. Mahoney 2003 Mahoney, A. and R. Scaife, edd., The Stoa: a consortium for electronic publication in the humanities [www.stoa.org]) edition of January 31, 2003. Malachowski 2003 Malachowski, Mitchell. 2003. "A ResearchAcross-the-Curriculum Movement." New Directions for Teaching and Learning 2003, no. 93: 55-68. Merkel 2003 Merkel, Carolyn. 2003. "Undergraduate Research at the Research Universities." New Directions for Teaching and Learning 2003, no. 93: 39-54. Nagy 2004 Nagy, G. "The Homeric Text and Problems of Multiformity" in, Homer’s Text and Language (University of Illinois Press, 2004) 25-39. Norcia 2008 Norcia, Megan. 2008. "Out of the Ivory Tower Endlessly Rocking: Collaborating across Disciplines and Professions to Promote Student Learning in the Digital Archive." Pedagogy 8, no. 1 (January 1): 91-114. Oates n.d. Oates, John F., Roger S. Bagnall, Sarah J. Clackson, Alexandra A. O'Brien, Joshua D. Sosin, Terry G. Wilfong, and Klaas A. Worp, Checklist of Greek, Latin, Demotic and Coptic Papyri, Ostraca and Tablets, http://scriptorium.lib.duke.edu/papyrus/texts/clist.ht ml; last updated 11 September 2008. Oxford 1996 Oxford Classical Dictionary (Oxford: Oxford University Press, 1996). Pleasant 2003 "Pleasant, Hershal. Demosthenes 50," in C.W. Blackwell, ed., Dēmos: Classical Athenian Democracy (A. Mahoney and R. Scaife, edd., The Stoa: a consortium for electronic publication in the humanities [www.stoa.org]) edition of January 31, 2003. Robillard 2006 Robillard, AE. 2006. "Young Scholars Affecting Composition: A Challenge to Disciplinary Citation Practices." College English 68, no. 3: 18. Roger 2003 Roger, Daniel. 2003. "Surviving the ‘Culture Shock’ of Undergraduate Research in the Humanities." CUR Quarterly 23, no. 3 (March): 132-135. Sperberg-Mcqueen 2004 Sperberg-McQueen, C.M. and Lou Burnard, edd, TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition (TEI Consortium,

88

CHANGING THE CENTER OF GRAVITY

2004): http://www.tei-c.org/release/doc/tei-p4doc/html/.] Stephens 2005 Stephens, Robert, and Josh Thumma. 2005. "Faculty-Undergraduate Collaboration in Digital History at a Public Research University." History Teacher 38, no. 4. http://www.historycooperative.org/journals/ht/38.4/s tephens.html. Thomas 2008 Thomas, Elizabeth, and Diane Gillespie. 2008. "Weaving Together Undergraduate Research, Mentoring of Junior Faculty, and Assessment: The Case of an Interdisciplinary Program." Innovative Higher Education 33, no. 1 (June 1): 29-38. Versnel 1992 Versnel, H. S. "The Festival for Bona Dea and the Thesmophoria," Greece &Rome 39, no. 1 (April 1992): 31-55. Willison 2007 Willison, John, and Kerry O’Regan. 2007. "Commonly known, commonly not known, totally unknown: a framework for students becoming researchers." Higher Education Research & Development 26, no. 4: 393-409. Wilson 2003 Wilson, Reed. 2003. "Researching ‘Undergraduate Research’ in the Humanities." Modern Language Studies 33, no. 1/2: 74-79.

TACHYPAEDIA BYZANTINA: THE SUDA ON LINE AS COLLABORATIVE ENCYCLOPEDIA ANNE MAHONEY TUFTS UNIVERSITY [email protected]

ABSTRACT The Suda On Line (SOL) is a collaborative translation of a Byzantine Greek encyclopedia. It makes this difficult but useful text available to non-specialists and, with annotations and search facilities, makes the Suda easier to use than it is in print. As a collaboration, SOL demonstrates open peer review and the feasibility of a large, but closely focused, humanities project.

INTRODUCTION The Suda On Line (SOL) is a translation of the Byzantine Greek Suda, written and edited by a large international group of scholars and students; its address is http://www.stoa.org/sol/1.

I was involved with SOL from the very beginning, and it was through this project that I first met Ross. I then served as programmer and co-editor for the Stoa for several years, a position in which I could see first-hand his energy, vision, and breadth of knowledge. With Diotima, 1

89

90

CHANGING THE CENTER OF GRAVITY

Figure 1. SOL index page: the gateway to the Suda On Line

The Suda is an encyclopedia of classical learning, written in the 10th century AD by a committee of scholars in Byzantium. It is a surprisingly useful source for classicists, but it is not well known to undergraduates or non-specialists because its style is crabbed and difficult. SOL is apparently the first translation of the Suda into English. I will argue that SOL is useful not only as a case study in scholarly collaboration, but as a tool for scholarly work both in classics and beyond. SOL was one of the very first collaborative encyclopedias, pre-dating Wikipedia by several years.2 Because the Suda itself is an

SOL, and the Stoa, Ross did a great deal for classics; these projects are a succesful combination of popularization, accessibility, and scholarship. 2 Although Wikipedia is perhaps the best known Wiki-format collaborative encylopedia, WikiWikiWeb came first, as early as 1995. See http://c2.com/cgi/wiki?WikiHistory for its history. Wikipedia itself was created in 2001, according to its own history, at

MAHONEY

91

unsystematic collection of lore, not all of it necessarily correct,3 SOL provides commentary and references for each entry. It therefore serves as a full-scale classical encyclopedia, comparable to The New Pauly or the Oxford Classical Dictionary and roughly between those two works in size. The name "Suda" means "bulwark" or "fortification" — that is, the Byzantine scholars wanted to produce a work that would stave off the destruction of classical knowledge. The work has sometimes been called the "Encyclopedia of Suidas," as if Suidas were a person, but this is now held to be incorrect. The Suda is organized as an encyclopedia, with entries in rough alphabetical order covering the important people, places, and texts of ancient Greece and the Bible. Its authors had access to some texts that are no longer extant, so there is material in the Suda that cannot be found anywhere else. They also had different editions of some of the texts we still read, so quotations in the Suda may reflect variants that are not preserved in our textual tradition; this makes the Suda important for establishing the correct text of some literary works, particularly Greek drama.4 Although (as noted above) SOL is apparently the first translation of the Suda into English, it is hardly the first translation out of Greek. That honor goes to Robert Grosseteste, who translated selected entries into Latin in the early thirteenth century. According to (Dionisotti 1990), Grosseteste's project was similar to SOL's: he wanted to make the Suda available and comprehensible to his contemporaries, so he annotated and glossed his translations.

http://en.wikipedia.org/wiki/Wikipedia:About. The name "wiki" is taken from a Hawaiian word meaning "quick" which I have Hellenized for the title of the present article. The reduplicated form "wikiwiki" is a frequentative or intensive. 3 As L. D. Reynolds and Nigel Wilson put it, "despite a certain amount of dubious or erroneous material [the Suda) transmits much useful information" (Reynolds 1991, 66); they go on to suggest that "the intelligence of the authors cannot be rated very highly," which seems a bit harsh. 4 For more on the history of the Suda and its importance to classical scholarship, see (Dickey 2007, 90-91).

92

CHANGING THE CENTER OF GRAVITY

Since the Renaissance there have been several more editions of the Greek text, along with Latin translations. Aemilius Portus produced the first complete Latin translation in about 1619. Ludolf Küster produced an edition in 1705, continued by Jonathan Toup. Thomas Gainsford's edition of 1834, also following Küster, was reissued along with a Latin translation by Gottfried Bernhardy in 1843. Immanuel Bekker's 1854 edition was the standard until the twentieth century. The current standard edition of the Suda was edited by Ada Adler (1878-1946) and published in five volumes between 1928 and 1938. Her numeration of the entries has become the standard reference scheme for the text; one refers to the entry "Abraham," for example, as "alpha 69." Scholars continue to look for good ways to work with the Suda. For example, one project which took place just before SOL got started is a database at the Université Cattolica in Milan tabulating all the entries related to Greek and Roman history. Its results are described in (Zecchini 1999).

HISTORY OF THE PROJECT The SOL project began on 14 January 1998, when Jeffrey Gibson asked on the Classics email list whether there was an English translation of the Suda.5 In fact, the Suda had never been translated into English, and it was suggested — perhaps tongue-in-cheek at first — that this would be a natural project for a web-based or emailbased collaboration. Very quickly people started discussing how such a collaboration might work. The title "Suda On Line" and the acronym SOL (Latin for "sun") were suggested by David Meadows.6 By the 20th, William Hutton had posted a prototype. The next day, it was announced that a computer science graduate student at the University of Kentucky was interested in working on

5 The Classics email list was at that time hosted at the University of Washington; it has since moved to the University of Kentucky. The official home page of the list is http://lsv.uky.edu/archives/classics-l.html. 6 The first recorded use of the pun on "sudor" (Latin for "sweat") came from Ernie Moncada; members of the team can colloquially be called "sudatores," meaning both "Suda workers" and "those who sweat."

MAHONEY

93

technical aspects of the project as a master's thesis. Translation and database design had both started by the end of the month, and the first version of what became the SOL system was announced on 12 April 1998. SOL was implemented as a semi-structured text, in an XMLlike markup, though without validation.7 Wiki technology was not widely known at the time, though as noted above WikiWikiWeb was already on-line; moreover, since Unicode was not yet ubiquitous, display of Greek would have been a problem in a standard Wiki. As a result, the SOL group decided to implement its own system. The translation, commentary, and revision history are stored in a database and converted to HTML on the fly for display. Greek is stored in beta-code and can be displayed in Unicode, transliteration, or any of several popular font encodings, using code graciously supplied by the Perseus Project.8 The underlying database system is QDDB, a non-relational database developed by Eric H. Herrin II and Raphael Finkel of the University of Kentucky.9 Code is written in Perl. Virtually all of the programming for the project was done by graduate students Huar En Ng, Mukund Chandak, Shahid Saleem Mohammed, and Kamal Shah. It was agreed very quickly that involving students both as programmers and as translators was desirable, and many students in computer science and classics, both graduate students and undergrads, have made very significant contributions throughout the project. At this writing (June 2007), SOL boasts 7 managing editors, 61 editors, and 95 translators, coming from a dozen countries. Three people — David Whitehead, Catherine Roth, and Jennifer

(Finkel 2000) is an overview of the design and implementation of the project, written by the original managing editors. 8 Beta-code is a plain ASCII encoding for polytonic Greek developed by the Thesaurus Linguae Graecae (TLG) long before Unicode was available, and widely used in classics. It is documented at http://www.tlg.uci.edu/BetaCode.html. 9 QDDB is available from Herrin Software Development; documentation and downloads are at http://www.hsdi.com/qddb/. (Herrin 1996) is a technical description. 7

94

CHANGING THE CENTER OF GRAVITY

Benedict — have translated over 4,000 entries each; Benedict translated most of hers while she was an undergraduate at the College of William and Mary. Managing editor William Hutton translated over 1000, while seven more people translated 200 or more. At the other end of the scale, some 40 people translated a single entry each. So far, over 21,000 entries have been translated, more than 2/3 of the total. Nearly all of the entries that have been translated have also gone through at least a first round of editorial vetting. Translation and editing are still going on.

TECHNICAL AND SOCIAL INTERFACES Collaboration within SOL, as in Wikipedia and other similar projects, takes the form of serial editing. Entries are assigned to translators at their request. Some translators may ask for particular entries on a subject they are working on; others take blocks of unassigned entries in numerical order. Translators then produce English versions, assign key words from a controlled vocabulary, and add initial notes and bibliography. As soon as the translator finishes, the entry is published, clearly marked as a draft. A subset of translators are designated editors and have the authority to change translations. Editors have scholarly qualifications in Ancient Greek; most are college or university faculty members. Their primary task is to augment the bibliography and commentary on the entries. They are also responsible for verifying that the translations are correct. The peculiar style of the Suda occasionally makes this a non-trivial problem. Its Byzantine authors were writing in a dialect somewhere between the classical Greek of the fifth and fourth centuries BC and the native language of the tenth century AD. They occasionally get snarled up in difficult grammar, and frequently use words that cannot be found in standard lexica of classical Greek. As a result, a translator may sail through half a dozen entries with no problem, then run into one that makes almost no sense at all. Editors are assumed to be more expert in Greek than ordinary translators, and often have particular areas of specialization, like history, poetics, or theater. An editor who updates and enhances a translation may change its vetting status from "draft" to "low" (minimally edited) or "high" (well annotated and of high scholarly quality). At present, some 3/4 of translated entries have "low" vetting status.

MAHONEY

95

SOL's editorial mechanism, then, is a type of peer review process. The original translator is always credited for the entry, but the editors who have worked on it are also named. When an entry is displayed, its revision history is shown along with it. Previous states of the text are not displayed by the regular display routines, but are available to translators and editors working on the text. SOL's review process is open: the editors know who translated the entry and which other editors have worked on it, and the translator can see who the editors are. In fact, any reader of SOL can see the names of the editors and translators of any entry. This highly transparent process is different from the blind reviewing typical for classics journals: in that system, in general, the author of an article does not know the referees' names, nor do the referees know who the author is. The open review process has been part of SOL all along and no one has objected, or indeed even commented on it. By now the process seems natural, because it is widely used in Wikis and blogs, but as SOL was getting started an open, public peer review system was unusual. What we gain from it is the ability to recognize everyone's participation. Perhaps more important, SOL shows how scholarship progresses. A translation or commentary published in a book appears final and finished; readers are not given any clues about how it came into being. SOL's translations and commentaries show the process of successive refinements, demonstrating that first drafts are almost never perfect, and that even senior scholars' work can benefit from editorial attention. Whenever an editor updates an entry, the SOL system automatically notifies the original translator by email. Translators may then, if they like, return to the entry, inspect the editorial changes, and make further modifications. The project also maintains a mailing list for announcements and general discussion, though it has been rather quiet since the last major software changes. Editing and translating both take place in the same web system, very similar to a Wiki but less elaborate, and imposing somewhat more structure upon the translated entries.

96

CHANGING THE CENTER OF GRAVITY

Figure 2. Editing screen: A translator or editor updating an entry sees the original Greek and the current version of the entry. Below these is a copy that can be changed.

Separate fields are provided for a translation of the headword, the translation of the text of the entry, notes, print bibliography, and web links. A menu facilitates construction of links to certain highly-used resources: Diotima, Perseus, and the Bryn Mawr Classical Review. Finally, keywords can be added from a pull-down list. Within the translation and notes fields, certain HTML-like tags are permitted, including for italics, for titles, and for Greek in beta-code. Other languages — such as Hebrew — are generally encoded in Unicode and not marked. References to other Suda entries by Adler number are automatically recognized and hyperlinked, just as cross-references are made within a Wiki. Readers reach the text through a search mechanism. References to the Suda in books or scholarly articles will frequently be by Adler number or by headword, so either of those may be specified as a search term. It is also possible to search for words in the text of the translation or the notes, to search by keyword, or to search for a particular translator or editor; the latter facility was implemented so that editors or translators could conveniently make links to their own contributions from an on-line CV. Finally, the default

MAHONEY

97

for the search is a full-text search in the entire entry, regardless of its internal structure. Thus, a reader can look for the entry on the fifth-century BC playwright Sophocles with a headword search, can find all the entries where the Suda itself refers to Sophocles with a search in the translated text, or can find entries that refer to Sophocles either in the text or in the annotations. In fact, there are four entries with "Sophocles" in the headword:

Figure 3. Search for Sophocles: Results of a search for the name "Sophocles" in headwords.

on the famous playwright (sigma 815), his grandson (sigma 816), a later descendant (sigma 817), and an epigram (sigma 820). "Sophocles" appears in the translations of 221 entries and in the notes to 609 (including almost all of those where the name appears in the translation). A separate search mechanism allows searching in the original Greek.10

10 The Greek text of Adler's edition was provided to SOL by courtesy of the TLG.

98

CHANGING THE CENTER OF GRAVITY

As a case study, we may consider the history of Alpha 376, "Agroikos ex asteos," which means "rustic from town"; see http://www.stoa.org/solbin/search.pl?searchstr=alpha,376&field=adlerhw_gr. This entry is fairly typical of the short entries. It was first translated on 27 March 1999; translator Anne Mahoney was a graduate student at the time. Editorial work began in May 2000 when William Hutton first vetted the entry and gave it "low" status. Further work was done by Ross Scaife and David Whitehead, who changed the item's editorial status to "high" in February 2003. Whitehead returned to the item in 2005.

Figure 4. Display of Alpha 376: A typical shorter entry. The display shows the translation, the Greek text, and the editorial history.

The headword is a phrase from Aristophanes' Clouds, and the text of the entry is taken from the scholia to that play. The main character of the play, Strepsiades, is a country man, but he has married an ambitious woman and they are now living in the city, somewhat above their means. He therefore asks the philosopher

MAHONEY

99

Socrates, who happens to be his neighbor, for lessons in clever argument, in hopes of talking his creditors out of calling in his debts. Naturally, it does not work out quite the way Strepsiades expected, and the play ends with an angry Strepsiades setting fire to Socrates' house.11 The original translation accurately represented the Greek, and noted that the headword appears in Clouds at line 47. The headword was first translated as "rustic man from town"; this became slightly more graceful as "a rustic from town." In the course of vetting, Whitehead clarified that the Suda took its text from a scholion to the play. The translator originally assigned the keyword "comedy" since the phrase comes from a comic play; the editors added key words "agriculture," "daily life," "economics," "ethics," "gender and sexuality," and "women," based on the content of the entry. At this point, with "high" editorial status, the entry is considered suitable for reference and citation, but it can still be modified again if an editor finds something further to say about it. Another example is Alpha 100, "Abydenon epiphorema," or "Abydene dessert." This is an obscure phrase, presented by the Suda as if it were a common proverbial saying. The Suda says that an "Abydene dessert" is something bad that happens as a result of someone showing up at the wrong time, and explains that in Abydos, people used to bring out their children after dinner, to the annoyance of their dinner guests. This all makes sense — the entry is coherent as it stands — but a reader might want to know where Abydos is and where this saying comes from. The SOL translation now provides this information: Abydos is in Asia on the shore of the Hellespont, and the saying ultimately comes from an ancient collection of proverbs, which moreover gives a second, completely different explanation for "Abydene dessert." As with other entries, this detailed background was added during the course of editing.

This is at least the ending of the version we have; the play was revised when its first performance was unsuccessful. See [Henderson 1998) for the text of the play, in Greek and English, and MacDowell for background (MacDowell 1995, 113-150). 11

100

CHANGING THE CENTER OF GRAVITY

The initial translation, in August 1998, had no annotations at all. The geographical note was added in January 2001, as was the references to sources for the proverb. The translation was improved as well. Anne Mahoney, Eric Nelson, and David Whitehead worked on this entry, which currently has "low" editorial status

SOL AND OTHER PROJECTS While the Web facilitates distributing a task like the translation of the Suda, nineteenth-century scholars undertook similar tasks with index cards and slips of paper. The idea of pulling together a large, multi-national team for a large task is hardly original to SOL — or to the Internet age. For example, the Oxford English Dictionary has always used readers to track down the usage history of words. Originally, they mailed slips of paper to the dictionary's editors; for the on-going current revision, readers submit citations by email.12 Similarly, the decipherment of the cuneiform scripts used in much of the ancient Near East was performed by a distributed, loosely coordinated group of amateurs: "Much of the ongoing work in deciphering the cuneiform inscriptions was still being carried on by amateurs — army officers posted to Persia or Iraq who fell under the spell of the antiquities there, or rural parish priests with time on their hands." ((Damrosch 2006, 16), referring to a period around 1860) Translating the Suda, by comparison with such undertakings as these, is a relatively small, bounded task. It also requires particular expertise: knowledge of classical Greek. Any reader of English may submit citations to the OED, and anyone may edit entries in projects like Wikipedia. Prospective SOL translators, however, must request authorization and must ask to be assigned specific entries, though virtually everyone who registers is approved. Its model, therefore, is more like that of Citizendium, an offshoot of Wikipedia which adds "gentle editorial oversight" (as it explains on its own home page, http://citizendium.org).

12 Details of the OED Readers Program are at http://dictionary.oed.com/readers/research.html; for the history of the dictionary, see (Winchester 2003).

MAHONEY

101

Wikipedia and Citizendium are new works, created largely from scratch. Although Wikipedia incorporates articles from earlier public-domain encyclopedias, in particular the eleventh edition of Encyclopaedia Britannica (1911) and the Catholic Encyclopedia (1913), and Citizendium incorporates articles from Wikipedia, in both cases these are a kind of bootstrap mechanism, to get a first version in place as a basis for further editing. Other on-line encyclopedias are digital versions of existing print books. These include commercial resources such as Oxford Reference on-line (http://www.oxfordreference.com) and publicdomain resources such as the Catholic Encyclopedia (http://www.newadvent.org/), generally older books whose copyright has expired. Within classics, the Perseus Digital Library (http://www.perseus.tufts.edu) includes the Dictionary of Greek and Roman Antiquities, edited by William Smith and others in 1890; the same author's dictionaries of geography and of mythology; Harper's Dictionary of Classical Antiquities, an abridgment of the Smith works; A Topographical Dictionary of Ancient Rome, by Samuel Ball Platner and Thomas Ashby; and The Princeton Encyclopedia of Classical Sites (PECS) by Richard Stillwell, William L. MacDonald, and Marian Holland McAllister. Of these, all but the last are old enough to be out of copyright; the Princeton University Press gave permission to digitize PECS. Unlike these encyclopedias, SOL is neither a completely new work nor a mere reproduction of an older one. It includes the entire text of the tenth-century original, but the entries are enhanced and expanded with modern scholarship. Of course an annotated Suda does not require the Web, but it would be unwieldy in print. Adler's Greek edition is four fair-sized volumes. The translation alone would naturally be about the same size, and SOL's annotations are considerably larger. On line, however, readers can see entries one at a time, or in groups of search results. Searching and indexing are also considerably easier: it would be difficult, for example, to look through Adler's print edition for the eight entries that use the ancient Greek word for "cat," or to dig out the entries that give etymologies. The indexes of a print edition facilitate the kinds of searching that the author or editor felt would be useful, and there is a necessary trade-off between providing indexes and the time and space required to produce and print them. Automatically generated indexes, on the other hand, can be replicated nearly

102

CHANGING THE CENTER OF GRAVITY

infinitely. SOL allows direct access to entries by Adler number, headword, original translator, and editor. It is also possible to search either in specific fields (translation, notes, bibliography) or without restriction, and there is also a search mechanism for the Greek text of all entries, translated or not. As a result, SOL not only provides more information than the plain text of the Suda (in whatever language), it also makes it easier to find and organize that information. SOL's organization reflects that of the original text. Every entry in SOL comes from the Suda; since the original book had no entry for "Old Comedy," for example (the particular type of comedy written by Aristophanes and others in fifth-century Athens), there is no general article on this genre in SOL. As a result, absolute beginners in classical studies might find SOL a difficult place to get basic orientation. Once a reader knows some of the basic terms of the field, however, it is straightforward to find the Suda's quotations from comedy and biographies of the major playwrights. Within those entries, the annotations and references direct the reader to more current work. The audience for SOL, then, is not only the specialist classical scholars who have always used the Suda in the original Greek, but also scholars in other areas (religion, for example), students, and general readers.13 If SOL is not quite the same as other on-line encyclopedias, it is also not quite the same as other works of classical scholarship. The Suda itself, as an encyclopedia, is a type of commentary on classical texts. Commentaries as a genre are perhaps more familiar to classicists than to scholars in other literary fields, as Most points out (Most 1999). Writing a commentary is still a prestigious, if nowadays somewhat old-fashioned, project for a classical scholar, and classicists generally read, study, and teach texts with commen-

SOL also makes the Suda available to classicists who do not have access to Adler's edition. Large universities with doctoral programs in classics will certainly have these volumes in their libraries, but smaller colleges or high schools probably will not. 13

MAHONEY

103

taries close at hand.14 SOL makes the commentary nature of the Suda more explicit by identifying and labelling quotations, providing specific references, and fleshing out the Suda's discussions. SOL is therefore almost the inverse of the "Do-It-Yourself Commentary" envisioned by Willard McCarty as one of the ways the Web empowers scholars (McCarty 2002, 376). Instead of sending readers out to make whatever links they want, the SOL presents links created by the translators; these may be hyperlinks in the technical sense, or links of the kind long familar to readers of classical commentaries: references to well-known texts by their standard reference schemes. On the other hand, every SOL translator may add links in the course of annotating an entry — not only the expert editors, but anyone who knows enough Greek to take part in the project. Thus the links within SOL are not restricted to the expected references, produced by classicists highly socialized in the discipline's thought patterns, but may include anything at all that a translator finds useful. Some two dozen entries link to Wikipedia, for example; others link to curious sites like the "Table of Nations" at http://www.mazzaroth.com/TableOfNations/TableOfNations2.h tm, purporting to list the descendants of Noah (from Epsilon 38, "Hebrews"). Although the range of relevant references may be quite broad, nonetheless annotating a Suda entry for SOL is very much the same kind of work as writing commentary notes for any other classical text: one identifies quotations and allusions, glosses difficult grammar or obscure words, and perhaps mentions similar passages in other works. That is, SOL's editors extend the commentary already implicit within the Suda, giving SOL's readers help both with reading the Suda itself and with reading the texts the Suda discusses. Because SOL is also a translation, it is, in Rydberg-Cox's taxonomy (Rydberg-Cox 2006, 22-24), a project based on "providing access to texts" and "helping readers understand scholarship." In translating the Suda, or commenting on translations, contributors

14 In addition to Most, see the articles in (Gibson 2002) on commentaries as a genre.

104

CHANGING THE CENTER OF GRAVITY

are in general not producing new knowledge either about the classical world that is the subject of the Suda or about the Byzantine world that produced it. Boyer has called this kind of work "scholarship of integration" (Boyer 1990, 18), and argues that academic work must go beyond the "scholarship of discovery" (Boyer 1990, 17), which is research in the traditional sense. The importance of SOL is not primarily in helping professional classicists make new discoveries, but in making this material more accessible and more comprehensible to a wider audience. SOL is among other things a form of outreach, a priority of the classics profession (see for example the APA's Outreach Division, http://www.apaclassics.org/outreach/outreach.html).

CONCLUSION As we have already observed, SOL's translation of the Suda is roughly two-thirds done, and virtually all of the translated entries have been raised from "draft" to "low" status by an initial vetting. The project is quite usable already, though Dickey is correct to point out that readers must pay attention to the amount of editorial attention each entry has received (Dickey 2007, 91). Based on usage statistics and external links, moreover, SOL is being read. According to the usage statistics page at the Stoa (SOL's home), SOL's search engine has received an average of 7589 hits in each of the last six months, with a range from 9591 in January 2007 to 6226 in April.15 Various classics portal pages provide links to SOL, notably the UK's Intute (in the Arts and Humanities section, http://www.intute.ac.uk/artsandhumanities/), and several blogs have mentioned it, for example http://curculio.org and http://gypsyscholarship.blogspot.com. Not surprisingly, the Wikipedia article on the Suda mentions SOL. Medieval and Byzan-

15 To be precise, the statistics page http://www.stoa.org/stats/ reports 6440 hits on SOL's search program in June 2007, 7128 in May, 6226 in April, 8550 in March, 7597 in February, and 9591 in January. It does not report hits for the editing and vetting routine or for SOL's home page, nor does it distinguish between hits from actual users and those from indexing crawlers.

MAHONEY

105

tine studies sites also link to SOL, for example at Notre Dame, at the Australian Association for Byzantine Studies (http://home.vicnet.net.au/~byzaus/links.html), at the University of Amsterdam (http://www.uba.uva.nl/humanities/object.cfm/objectid=93038D 9F-F02C-40FA-9959AE8B5DFACDB7), at the University of Chicago (http://www.lib.uchicago.edu/e/ets/efts/Medieval.html), and even in Google's directory for Medieval and Byzantine studies (http://directory.google.com/Top/Society/History/By_Time_Peri od/Middle_Ages/Byzantine_Empire/). As a rule, only classics specialists would use the Suda in Greek, because of its uneven quality and un-classical Greek style. The uneven quality is mitigated by editorial annotations, pointing out where the Suda has confused two similar names or the like. The problem of the gnarled Greek, of course, is solved by the translators. Finally, the problem of figuring out where to look in the Suda for potentially relevant information is solved, at least in part, by the search and index mechanisms. SOL therefore makes this resource available to people who could never have used it before. In less than ten years, then, with minimal funding and largely volunteer labor (the student programmers were paid), this project has gone from a query on an email list to a fairly widely-known resource for the study of the classical world. The technical decisions made by the project were appropriate given available technology in 1998, when implementation began. At that time, most prospective translators did not have access to Unicode, so the project chose beta-code, the established standard encoding for classical Greek. Wikis existed but were not well known. If the project team had been aware of this technology, it could perhaps have adapted a Wiki to manage parallel Greek and English versions of articles, yet it was hardly an irrational decision to write a custom system. Moreover, implementing and maintaining the system was a useful exercise for several students, exposing them to humanities computing and to classical scholarship. The underlying system is not intended to be general; it was written for the Suda and therefore embodies several assumptions about the project, for example that the base text is in Greek, that it is divided into entries identified by "Adler numbers," and that those entries are generally fairly short. It would be a challenge to retro-fit this code for another project, even one conceptually similar like, for example, translating the Rig Veda from

106

CHANGING THE CENTER OF GRAVITY

Sanskrit to English. While this means that SOL has not been technologically seminal, it did make the project simpler to implement. The organizational decisions require even less argument. SOL was from the beginning governed by a small team of managing editors, and in 10 years it has added only two people to that group. Because the project requires knowledge of ancient Greek, it was sensible to check prospective translators' qualifications before giving them access to the text. And because the project was intended to be scholarly, it was appropriate to divide contributors into translators and editors, requiring greater expertise from the latter group. What has made SOL successful, however, is its focus. Instead of setting out to produce a complete reference on all of classical antiquity, the project team chose to translate and annotate a single text — a large, complex one, to be sure, but nonetheless a single, bounded text. As a result, there were natural milestones: first hundred entries done, half the entries done, and so on. This helped keep the project moving, as there is always a goal in sight. Moreover, no one faced the problem of figuring out what to say. The entries already existed, so all the contributors had to do was translate them, then explain them. Translating an existing entry, even a long one like Homer (omicron 251) or Jesus (iota 229), seems much easier than writing an essay about the life and works of such a figure. Citizendium has created what might function as a similar focusing mechanism, dividing the topics it wishes to cover into a series of disciplinary groups and listing the articles to be written in each topic, though not every topic's editors have yet created a comprehensive list. (For classics, for example, the "priority articles" to be written are Herodotus, Latin language, Vergil, Ovid, and Cicero — barely a beginning.) Wikipedia, on the other hand, has no set topic list, but accepts entries on everything from classical authors to American professional soccer players. As a result, Wikipedia will never be "finished" in any meaningful sense. It might be possible to have a complete set of articles on classical antiquity within Wikipedia, but the encyclopedia as a whole is deliberately open-ended. The linguistics community has proposed a collaborative project to update the linguistics-related entries in Wikipedia, at least some of which are taken over from Encyclopedia Britannica and have not been updated at all (see http://www.linguistlist.org/issues/18/18-1831.html for the an-

MAHONEY

107

nouncement). Classicists, or other disciplinary groups, could do the same. To begin with, such a project would be bounded in the same way as SOL, because scholars would be working on the entries that are already in Wikipedia. But this project could also have the opposite advantage: scholars would not be constrained to only the existing entries and their structure, but could add or reorganize as appropriate. SOL has shown us that this sort of collaboration can work; as Suda translation comes to a close, perhaps translators looking for another project could follow the linguists' lead. As a tool for scholarly work, SOL makes the Suda accessible as it has not been before. SOL makes the Greek text available to those who can read it and supplies an English translation for those who cannot. Its searching and indexing make the Suda easier to use than it is in print form. The commentaries both within the Suda itself and added by the SOL translators bring the somewhat random 10th-century encyclopedia up to date. As a collaboration SOL demonstrates the feasibility of open peer review and the value of incremental progress. For readers interested in getting the flavor of the Suda, the following entries are good starting points: Iota 229, Jesus, translated by Catharine Roth (http://www.stoa.org/solbin/search.pl?searchstr=iota,229&field=adlerhw_gr) Omicron 251, Homer, translated by Malcolm Heath (http://www.stoa.org/solbin/search.pl?searchstr=omicron,251&field=adlerhw_gr) Alphaiota 230, ainos (fable), translated by Ross Scaife (http://www.stoa.org/solbin/search.pl?searchstr=Alphaiota,230&field=adlerhw_gr) Alpha 3932, Aristophanes, translated by Jennifer Benedict (http://www.stoa.org/solbin/search.pl?searchstr=Alpha,3932&field=adlerhw_gr) Kappa 2287, Constantinople, translated by David Whitehead (http://www.stoa.org/solbin/search.pl?searchstr=kappa,2287&field=adlerhw_gr).

BIBLIOGRAPHY Adler 1928 Adler, Ada (ed). Suidae Lexicon. Lepizig: Teubner, 19281938.

108

CHANGING THE CENTER OF GRAVITY

Boyer 1990 Boyer, Ernest L. Scholarship Reconsidered: Priorities of the Professoriate. Carnegie Foundation for the Advancement of Teaching, 1990. Cancick 2002 Cancick, Hubert and Helmut Schneider (eds). Brill's New Pauly: Encyclopedia of the Ancient World. Leiden and Boston: Brill, 2002. Damrosch 2006 Damrosch, David. The Buried Book: The Loss and Rediscovery of the Great Epic of Gilgamesh. New York: Henry Holt and Company, 2006. Dickey 2007 Dickey, Eleanor. Ancient Greek Scholarship: A Guide to Finding, Reading, and Understanding Scholia, Commentaries, Lexica, and Grammatical Treatises, from Their Beginnings to the Byzantine Period. Oxford: Oxford University Press, 2007. Dionisotti 1990 Dionisotti, A. C. Robert Grosseteste and the Greek encyclopaedia. In Rencontres de Cultures dans la Philosophie Médiévale, ed. Jacqueline Hamesse and Marta Fattori (Louvain-la-Neuve: Publications de l'Institut d'études médiévales, 1990): 337-353. Finkel 2000 Finkel, Raphael, William Hutton, Patrick Rourke, Ross Scaife, Elizabeth Vandiver. "The Suda On Line." Syllecta Classica 11 (2000): 178-190. Gibson 2002 Gibson, Roy K. and Christina Shuttleworth Kraus, eds. The Classical Commentary: Histories, Practices, Theory. Leiden and Boston: Brill, 2002. Henderson 1998 Henderson, Jeffrey. Aristophanes: Clouds, Wasps, Peace. Loeb Classical Library. Cambridge: Harvard University Press, 1998. Herrin 1996 Herrin, Eric H., II, and Raphael Finkel. "Schema and Tuple Trees: An Intuitive Structure for Representing Relational Data." Computing Systems 9.2 (1996): 93-118. Hornblower 1996 Hornblower, Simon, and Anthony Spawforth, eds. Oxford Classical Dictionary, 3rd edition. Oxford: Oxford University Press, 1996. MacDowell 1995 MacDowell, Douglas M. Aristophanes and Athens: An Introduction to the Plays. Oxford: Oxford University Press, 1995. McCarty 2002 McCarty, Willard. "A Network with a Thousand Entrances: Commentary in an Electronic Age?" In Gibson and Kraus (eds). The Classical Commentary: Histories,

MAHONEY

109

Practices, Theory (Leiden and Boston: Brill, 2002): 359402. Most 1999 Most, Glenn W. (ed). Commentaries = Kommentare. Göttingen: Vanhoeck and Ruprecht, 1999. Pfeiffer 1999 Pfeiffer, Rudolf. History of Classical Scholarship: 13001850. Oxford: Oxford University Press, 1976; reprinted Sandpiper Books, 1999. Reynolds 1991 Reynolds, L. D. and N. G. Wilson. Scribes and Scholars: A Guide to the Transmission of Greek and Latin Literature, third edition. Oxford: Oxford University Press, 1991. Rydberg-Cox 2006 Rydberg-Cox, Jeffrey A. Digital Libraries and the Challenges of Digital Humanities. Oxford: Chandos Publishing, 2006. Winchester 2003 Winchester, Simon. The Meaning of Everything: The Story of the Oxford English Dictionary. Oxford: Oxford University Press, 2003. Zecchini 1999 Zecchini, Giuseppe (ed). Il lessico Suda e la memoria del passato a Bisanzio: Atti della giornata di studio. Bari: Edipublia, 1999. .

EXPLORING HISTORICAL RDF WITH HEML BRUCE ROBERTSON MOUNT ALLISON UNIVERSITY [email protected]

ABSTRACT The Web, though full of historical information, lacks a means of organizing that information, searching on it or visualizing it. The Historical Event Markup and Linking Project (Heml) was begun six years ago to explore how disparate historical materials on the Internet can be navigated and visualized, and for the past four years has used an XML data format defined in W3C Schemas. This format aims for conforming data that can be quickly parsed but provide a variety of facets on which to search for historical materials. While the project's graphical visualizations are in some respects successful, they have revealed some deficiencies in the underlying data format: it ought to provide for nested events, it ought to represent relations of causality between events and it ought to express the varieties of scholarly opinion about the attributes of events. By encoding the Heml data in the Resource Description Framework (RDF) it is possible to undertake these improvements. Moreover, an RDF-encoded Heml process provides easier access to CIDOC-CRM data into Heml events. Finally, a historical RDF language would simplify the discovery of references to historical events in digitized texts, thereby 111

112

CHANGING THE CENTER OF GRAVITY automating a growing network of historical information on the Web.

INTRODUCTION Beginning his work which was to be the first called a "history", the Greek author Herodotus describes his task in simple terms: "to preserve the memory of the past by putting on record the astonishing achievements both of our own and of other peoples" (line 1.0). Within the first few chapters, though, it becomes clear that, to Herodotus, this entails something more multivalent than mere storytelling. His very next sentence buttresses the narrative with references to its various, even conflicting, sources (1.1); not much later, he corroborates his stories by referring to physical artifacts and providing careful genealogies (1.7, 1.14); and throughout he attempts to place the notable events of the past within the available chronological outlines (e.g, 16.1). From its origins, then, history has been conducted across a framework of argumentation, evidence, chronology and prosopography. This intellectual approach, begun by Herodotus and his rough contemporaries around the world, has proved a durable and useful way of thinking. Moreover, as the global Internet expands, and the common software technologies of the World Wide Web provide us with unprecedented means of communication, humanity has greater access to more historical statements and more of the raw materials of history than ever before. Indeed, it might be said that today's Web has the makings of an historian's fantasy: it provides worthy encyclopedia entries on a vast array of topics; its textual editions are rapidly improving in quality and amassing in quantity; and it offers historical source material ranging from argumentation in on-line editions of the best journals to the first-hand accounts appearing in blog entries. Scholarly historical projects published uniquely on the Web are common, supported by academic centres whose purpose is to communicate history effectively in new media, and who are also addressing the general problems of adopting best standards of scholarship to the Web in projects such as Zotero (http://www.zotero.org/). Behind the scene, data formats such as the CIDOC-CRM aim to make it possible to interchange and reconcile vast cultural archives. However what today's Web offers in historical content, it lacks in organization. The multivalent nature of historical thought

ROBERTSON

113

that we noted in one of its earliest practitioners eludes the keyword-indexed approach to the Web today on offer through Google and other search engines. Though we can summon up an exhaustive list of Web resources that contain the words "Gallipoli" and "sources", today's Web cannot effectively respond to a basic historical question such as, "which sources attest the Gallipoli Campaign of World War I?" much less a more advanced one such as, "what evidence is there for major architectural projects undertaken in the U.K. during the period of the Boer War, and does anyone think that these projects are influenced by the conflict?" If it were possible to conduct such a query, of course the result would not itself qualify as "history" any more than the results of a Google search constitute knowledge. However, the results of such a hypothetical search would offer a considerable aid to historical research and thinking, just as the indices, maps and, occasionally, tables that appear at the end of most monographs allow the reader rapidly to access matters of interest. Applying this analogy, what is needed in the Web is "universal back-matter", a Web-wide equivalent to the traditional print monograph's helpful conventions.1 Herodotus and his successors have given us a template for such a project. Minimally, it will need to express associations between words describing the event through chronology, geography and prosopography. It will need to provide references to diverse sources and evidence, and it will need to traverse, as much as possible, the boundaries of local language and even local time-keeping

1 It might be observed that since Wikipedia has made great strides in organizing historical information on the Web, the (seemingly inevitable) continued growth of this resource will fulfil this goal. However, as an encyclopedia, Wikipedia does not aim to be exhaustive. Its notability guidelines rightly forbid a great deal of material to which historians might wish to have ready access. Moreover even those Wikipedia article which do explain historical events cannot comprise exhaustive lists of references, secondary sources and other evidence. As described later in this paper, the Heml Project has has good experiences using the Semantic MediaWiki extension to its Mediawiki software as a means of RDF data entry.

114

CHANGING THE CENTER OF GRAVITY

techniques.2 Indeed, over fifteen years ago, the prescient Historian's Workstation Project employed the same basic types of data (Thaller 1991); however, the Internet has added the challenge of working with a more heterogeneous set of data sources and today's possibilities for the visualization of historical data go far beyond the Workstation's.3 The Historical Event Markup and Linking Project (Heml) has pursued the vision of a system such as this since 2001, when it began to explore the markup and transformation of historical materials on the Internet using XML tools. All schemas and code developed by this project are available at its website, http://www.heml.org/ and complete revision history is provided in an SVN repository. This paper explains Heml's next step toward the goal of an historical Web. It outlines the data format used by the project, shows the shortcomings of our previous XML-based approach to historical markup and describes the potential of an historical markup scheme based around the W3C's Resource Description Framework, or RDF (http://www.w3.org/TR/REC-rdfsyntax/), to interchange historical concepts widely and to express them in a more nuanced manner. It is a pleasure to be able to offer this paper in the collection of works celebrating the achievements of Ross Scaife. Like so many efforts in computing and the Classics, this one has been nurtured by Ross' kind support and encouragement. Furthermore, I have always hoped that Heml would attain the level of openness and helpfulness that are the hallmarks of Ross' work and of his character.

2 (Carrera 2002) defines the set of "cultural semantics" as those concept which requires the localization of language and terminology. (Drucker and Noviskie 2004) make it clear that the problem of timekeeping and calendars, personal or collective, should be added to the list of "cultural semantics". 3 The Electronic Cultural Atlas Initiative has developed a coordinated network of databases, whose limited access helps to ensure the authority and quality of the aggregated data. A windows application is available for entering conforming data, and the excellent TimeMap software produces rich historical GIS visualizations in the client's browser.

ROBERTSON

115

THE HEML DATA MODEL The goal of a searchable network of historical information can only be reached through a clear separation between the model of the data being searched, the parameters of the search and the resulting visualization of the data. Since the project intends that these visualizations be generated dynamically in response to users' queries, the data model must be specific enough that the visualization can generated quickly; likewise, since the results of users' queries could result in no events, one event, or thousands of events, and the range of values represented in these encoded events could be quite wide, the visualization processes must be written so as to generate useful views without regard for the number of events visualized or how close or far they are in time and space. For this reason, the Heml project has attempted to steer a middle path between pragmatism and markup idealism, especially regarding chronology. Thus, its data schema, on one hand, is not a complete abstraction of any way in which a person might think of the past, since this would make the task of visualization too daunting. In particular, its schema requires machine-readable data for all defined geographical and chronological data. (As shown below, this does not preclude the schema representing uncertainty.) On the other hand, the schema is not merely an API for the various visualizations offered by Heml; it is abstract enough that many other visualizations and uses could be discovered. At its inception, the Heml server software transformed XML documents one-by-one (Robertson 2002); in versions 0.4 to 0.7.2 the project adopted a distributed XML model, backed up by the eXist (http://www.exist-db.org/) XML database (Robertson 2004), an approach that worked well in a collaboration with Tom Costa's project, Geography of Slavery in Virginia (http://www2.vcdh.virginia.edu/gos/), where a Heml Web application backed by eXist transforms events pertaining to slaves into hundreds of dynamically generated maps and timelines. In either case, the project's data model has remained the same for the past four years, and the following describes the model labelled 2003-0917. The Heml data format contains a collection of modelled events, each tagged with heml:Event. At their most simple, events bind an event label with a machine-readable span of time and a reference to evidence, which could be as simple as a single URN.

116

CHANGING THE CENTER OF GRAVITY

Optionally, a heml:Event may also comprise any number of keywords, and a single location, which in turn is defined as a labelled pair of latitude and longitude coordinates. heml:Person elements may be added to events in order to represent those people who were in some way involved in the event. If the nature of this participation is known, the heml:Person element may be bound to a heml:Role element within the context of the event. Persons, roles, locations, and keywords are assigned mandatory URIs so that they may be referred to in multiple events. Finally, one or more heml:Evidence elements must be attributed to each event, and within these there is a means by which different editions and linguistic representations of the same text may be grouped together for the researcher's benefit.

CHRONOLOGY The most complex part of Heml's data model is the heml:Chronology element. Its model is intended to express uncertainty and ranges of time without requiring the visualization engine to have access to all data.4 In Heml markup, chronological concepts are always built using one of the following four elements: heml:DateTime, heml:Date, heml:Year or IntCalDate. The first three of these encapsulate data encoded in the corresponding XML Schemas format; the last permits the user to encode with a nonGregorian calendar. When used alone, these elements are meant to indicating a corresponding span of time: for instance the -31 indicates an event that began on the first second of the first day of 31 BC and ended on the last second of its last day. To express a more expanded range of time, the heml:DateRange element is used, and within it mandatory heml:StartingDate and heml:EndingDate elements. This construct is parsed as indicating a span of time beginning at the beginning of the first element and ending at the ending of the last one. To express uncertainty, the heml:BoundedDate element is used. It com-

4 This precludes the use of the exhaustive representation of temporal expressions in natural language that the TimeML markup language provides.

ROBERTSON

117

prises a heml:TerminusPostQuem and a heml:TerminusAnteQuem element; these express the earliest possible and latest possible time respectively of a span of uncertain time. Use alone, heml:BoundedDate makes no claim about the duration of the encoded event. However, it is permitted to use a heml:BoundedDate within the heml:StartingDate or heml:EndingDate of a heml:DateRange.

1995-05 21T21:03Z



2005-03-21

2005-03-21



Example 1. Example of Chronological Information Expressed in Heml's 2003-09-17 Schema

Thus the XML in Example 1 encodes as an event beginning at exactly 21:03 UTC on May the twenty-first 1995 and ending at some time on March the twenty-first 2005.

SIMILAR SCHEMAS There are other schemas from other fields that encode representations of labelled spans of time and place. Dublin Core Metadata standard (http://dublincore.org/) includes date and location tags that, in certain uses, encode certain kinds of historical events, but these naturally revolve around the documents themselves, such as

118

CHANGING THE CENTER OF GRAVITY

publication information; they cannot, for instance, encode the battle described therein. Similarly, the P5 edition of the TEI includes a welcomed set of biographical and prosopographical tags (http://www.tei-c.org/Activities/Workgroups/PERS/), including one for events. However, at present, all such P5 event elements must appear within a person or place element, expressing that the event information serves to further our information about a given person or place: events that pertain to neither at present cannot be encoded. Finally, the CIDOC-CRM (http://cidoc.ics.forth.gr/) encodes descriptions of cultural artifacts in terms of the events they undergo -- their creation, change of custody, and so forth. Though not the primary focus of this schema, historical events are quite broadly modelled in this schema. Indeed, as is shown below, the unmodified CIDOC-CRM model is too liberal for the Heml Project's purposes of generating historical visualizations, especially with reference to what entities and datatypes it accepts as chronological predicates. Nevertheless, it is likely that most of the goals of the Heml project can be fulfilled through RDFS subclassing of the the CIDOC-CRM model, ensuring more carefully data-typed chronological classes. In any case, the discussion and examples below would apply as well to such a constrained CIDOC-CRM model or to any future schema with similar properties. Indeed, the purpose of the Heml project is decidedly not to promote a particular schema; instead, its schema is intended as a tool which enables us to explore the process of combining all material encoded in all these well-entrenched and widely-adopted schemas so as to increase the pool of inter-related historical data on the Internet. Technologically, there is little challenge in transferring data from one of these representations to Heml. XML representations can be exchanged through XSLT. In RDF, the SPARQL CONSTRUCT (http://www.w3.org/TR/rdf-sparql-query/#construct) query form produces transformations from one graph to another. For example, (Robertson 2006) shows how useful a simple Heml and Dublin Core mash-up conducted in XSLT could be.

VISUALIZATIONS Heml data is encoded in order that resources pertaining to a certain historical event may be found in response to a user's query. The query might be expressed with regard to any one of the facets of

ROBERTSON

119

historical expression encoded by the schema above, or on some combination of these. The software discovers the corresponding materials, then generates a visualization of these data according to the user's wishes. Visualizations range from simple chronological text lists in HTML to complex animated maps. Of course, since the data has been prepared without reference to any given visualization, not all visualizations are informative representations of all query results. For instance, an animated map would not be helpful if the set of events in the query response all had the same location or did not have locations encoded at all. In the most recently-published version of Heml's server software (v. 0.7.2), visualizations are generated through a Java Web application based on the Cocoon Web (http://cocoon.apache.org/)service engine. Instructions for building and running this software locally are available at http://www.heml.org. Using this approach on locally-stored data takes full advantage of Cocoon's ability to cache server-side processes. However, it is also possible to use the server at http://www.heml.org to transform any conforming document published on the Web without building or installing software. Instructions are provided online. As described above, all the visualization tools developed by the project operate fully automatically: they provide sensible chronological and geographical boundaries based only on the conforming data on their input and layout the appropriate text without clipping or abbreviation, regardless of the number of events being presented. The graphical timeline view, of which Figure 2 and Figure 3 are examples, lays out event labels at a vertical distance from each other corresponding to their separation in time. It indicates spans of time through vertical coloured lines. This view was designed to be legible and to require as little user interaction as possible, emulating a graphical timeline in print. In fact, the rendering software iterates through layout scenarios in order to find one that is appropriately compact but whose text is still possible to read. It ensures

120

CHANGING THE CENTER OF GRAVITY

that the entire text labelling an event is always visible to the user, and scrolling is not necessary to read the event label. The software also selects an appropriate range of dates or times for the events; these can range from a seconds to millennia.5 Rendered in SVG, this view's event labels are linked to popups that lead the reader to the resources pertaining to the event. This approach to timeline-generation differs from other online graphical timeline-drawing, perhaps best represented by the the Simile timeline widget (http://www.similewidgets.org/timeline/). At present, the Simile timeline is not autoranging, and does not attempt to optimize layout. Rather, its documentation encourages users to identify so-called "hot zones", temporal ranges comprising unusually large numbers of events so that the drawing routine can alter the scale in these ranges, thereby producing a more legible layout. The goals of the Heml project, requiring a fully automated rendering, do not make such handtweaking possible. However, Simile timeline's excellent use of client-side rendering and DHTML are certainly preferable to the Heml timeline approach. We hope that by modify the open-source code for Simile timelines we can adopt that project more closely to our needs, perhaps by identifying "hot zones" computationally.

5 See the timeline views of the columbia_accident and greek_prehistory at http://www.heml.org for examples of very brief and very long events rendered in this view.

ROBERTSON

121

Figure 1. Image of a SVG Animation Generated by the v0.7.2 Heml Web application

Another event-visualization system pioneered by the Heml Project is the animated map illustrated in Figure 1 (1) are a set of user-operated controls familiar from audio-visual apparatus. With these, the user can stop, play pause, rewind and speed up the animation. Item (2) is a slider control that sets the length of the animation in seconds. Item (3) is a moving marker that runs along the line above the animation from left to right, thereby indicating the moment in time currently represented on the map. Above the small triangle a constantly updating text appears recording the current animation time. The event labels at (4) appear in red text as long as the animation is representing a date during which the labelled event is encoded as having taken place. In order that the user be able to read them, event labels do not necessarily disappear as soon as the corresponding event ends in the animated time; rather they remain on the screen in black text until the text has appeared for at least two seconds. The system designates an appropriate chronological scale to the process, as well as the geographical range of the map.

AREAS FOR IMPROVEMENT Although these visualizations have fulfilled some of our goals, they have also pointed out some deficiencies in the underlying data model. Figure 1, for example, presents an apparent mishmash of

122

CHANGING THE CENTER OF GRAVITY

events. Perhaps the user undertook a too-broad query, but when dealing with any large dataset this sort of result is common because the XML schema puts all events, no matter how trivial, on the same footing. It has therefore been suggested that events should be ranked, or should nest, with large scale events such as 'The Persian Wars' expressed as parents of their composite events, like 'The Battle of Thermopylae'. In response to publications and presentations of the Heml/XML model, others have expressed a desire for more complex historical relations between events to be encoded. Should not Heml express something as simple as causality, such as the fact that Ephialtes' treachery led to the defeat of the Three Hundred? Finally, it has been noted that with Heml/XML markup it is not possible to encode the variations in opinions regarding historical events. When scholars debate the date of the arrival of the Greekspeaking ancestors of the Mycenaeans, what date should be encoded for this event? Some of these issues have been tackled recently by (Nakahira 2007), but in the Heml XML Schema they were intentionally left unaddressed for both theoretical and practical reasons. First, it has been this project's experience that resolving such relations with a language such as XQuery (http://www.w3.org/TR/xquery/) or XSLT was difficult and computationally expensive. Secondly, since the XML Schema expresses matters pertaining to an event by nested elements, it was unclear where a statement linking two elements should appear. Pertaining no more to one element than the other, ideally it should stand outside the heml:Event elements, but this raised the prospect of always new schemas required to provide metadata for previous schemas. Finally, a XML schema through which scholarly debate could be encoded for any facet of an event seemed likely to be unmanageably complex.

RDF AND HEML Our work of the past year shows that these problems can be addressed by using the W3C's Resource Description Framework (RDF) as our means of encoding Heml events and related informa-

ROBERTSON

123

tion.6 As a collection of statements expressed through URIs, RDF is well suited to define metadata that expresses associations between events. Because these statements are unordered, these associations can be asserted in any source of data, anywhere. Moreover, RDF defines a method known as "reification" whereby a statement comprising Subject Verb and Object can itself play the role of subject or object in another, encompassing statement. Example 2 illustrates the process with an example from Greek prehistory. The first line makes a simple statement, without attribution, that the ancestors of the Mycenaeans arrived in the Greek mainland in 1600 BC. The second line indicates the reification of that statement with parentheses, and attributes this chronology to Drewes with the hemlRDF:asserts predicate. The third line refers to the same subject, but it shows that Renfrew assigns that subject a much earlier date. Statement :

-1600 Reified Statement A:

( -1600) Reified Statement B:

(

8 == "" t 9 == Verb:

Rules 2-5 introduce the strategy we call overriding, answering a query that is usually answered by a more general node in order to provide specific results for this situation. The left-hand sides of these rules use braces instead of angle brackets, indicating that the order of appearance of the atoms is irrelevant for matching the rules. The atom I that appears on the right-hand sides of these rules is a morphophoneme that either disappears or converts to an i or an e, depending on surrounding context, during a postprocessing step. Rule 7 introduces the lookup strategy, by which particular information is obtained by reference to a special-purpose node. It directs a query such as f a c to the AEE node to convert the a to ē, so the perfective stem of facere becomes fēc. Here is that lookup node: AEE: 1 == $letter#1 Ɲ $letter#2

This node depends on a definition (not shown) that defines what atoms are in the category "letter." Rule 1 says that any query beginning with a letter, then the atom a, then any letter, should evaluate to the two letters surrounding ē instead.

272

CHANGING THE CENTER OF GRAVITY

THE VERB NODE Queries addressed to Praise are generally deferred to its parent, VerbA, which then defers them further to Verb. Verb: 1 {$conj34 1 sg future/present imperfective indicative/subjunctive} =+= 2 {future subjunctive} == ! 3 {perfective passive} == Sadhi: , ToBe: 4 == Sandhi:

Rule 1 reflects the future indicative to the present subjunctive in verbs of conjugations 3 and 4 (abbreviated by the value $conj34) in the first singular imperfective. This rule is quite specialized, applying, for instance, to dşcam "I will lead / may I lead." Without this rule, we would produce dşcēo "I will lead." Rule 2 indicates that there is no result for a query involving future subjunctive forms; Latin does not have these forms. Rules 3 and 4 reflect to the Sandhi node as a postprocessing step after assembling the components of a verb. The rule with the widest applicability, Rule 4, is the default rule. It combines the results of queries directed to the four nodes StemAspect, SuffixTense1, SuffixTense2, and SuffixPersonVoice, along with the marker wordEnd for postprocessing. Rule 3 generates forms for the perfective passive, which involve a participle, an adjectival end, and a specific form of the verb esse, which has its own node ToBe.

AUXILIARY NODES The Verb node invokes several auxiliary nodes. StemAspect: 1 {imperfective} =+= "" 2 {perfective} =+= ""

FINKEL AND STUMP

273

The stem depends on the aspect; it either results in the imperfective or the perfective stem. The =+= notation preserves all the elements of the query path, including those otherwise removed by matching the left-hand side of the rules. SuffixTense1: 1 == 2 {conj1 present imperfective subjunctive} == Ɲ 3 {present imperfective subjunctive} == Ɨ 4 {perfective} == e r 5 {past perfective subjunctive} == i s s Ɲ 6 {present perfective indicative} == I 7 {3 pl present perfective indicative} == Ɲ r 8 {3 pl present perfective subjunctive} == e r 9 {past imperfective indicative} == b 10 {future imperfective indicative} == b 11 {past imperfective indicative conj3io} == i Ɲ b 12 {past imperfective indicative conj4} == Ɲ b 13 {future imperfective indicative $conj34} == 14 {future imperfective indicative conj4} == Ɲ 15 {future imperfective indicative conj3io} == Ɲ 16 {past imperfective subjunctive} == r Ɲ

The SuffixTense1 node contains most tense information. It ranges from very specific rules, like Rule 14, to fairly general rules, such as Rule 4, which is overridden by more specific Rules 5-8. SuffixTense2: 1 {past indicative} == Ɨ 2 {present perfective subjunctive} == Ư 3 ==

The second tense suffix is usually empty (Rule 3), but it is occasionally either Ć or ĩ.

SuffixPersonVoice: 1 {2 sg} == I SuffixVoice SuffixPerson 2 {2 pl passive} == I m i n Ư

274

CHANGING THE CENTER OF GRAVITY

3 == SuffixPerson SuffixVoice

The suffix for person and voice is occasionally quite specific, as in the second person plural passive. In the second person singular, the voice suffix (r for the passive) precedes the person suffix, as in laudĆris "you (sg) are praised," whereas the voice suffix usually follows the person suffix, as in laudĆmur "we are praised." SuffixVoice: 1 {passive} == r 2 {passive 3} == u r 3 ==

In general, Rule 3 indicates that there is no suffix for voice. However, there is a suffix for the passive voice, which is generally r (Rule 2) but sometimes ur (Rule 3). SuffixPerson: 1 {1 sg} == m 2 {1 sg present imperfective indicative} == ǀ 3 {1 sg future indicative} == ǀ 4 {1 sg present perfective indicative} == Ư 5 == SuffixalVowel Desinence

The personal suffix is usually a vowel and a desinence (Rule 5), but the first person singular is exceptional, with a general suffix m (Rule 1) but sometimes Ň or ĩ. SuffixalVowel: 1 {future} == I 2 {present imperfective indicative} == I 3 {past imperfective subjunctive} == I 4 {3 pl +2} == u 5 {3 pl future perfective active} == I 6 ==

The vowel has several forms; sometimes it is the morphophoneme I as in laudĆbit "he will praise," but sometimes u, as in laudĆbunt "they will praise." Desinence:

FINKEL AND STUMP 1 {2 sg} == I s 2 {2 sg present ¯Õ 3 {3 sg} == t 4 {1 pl} == m u 5 {2 pl} == t i 6 {2 pl present

7 {3 pl} == n t

275

perfective indicative} == s t s s perfective indicative} == s

The desinence provides the final consonants, typically marking person and number, but occasionally influenced by aspect and tense (Rules 2 and 6).

THE SANDHI NODE After we assemble the entire verb, we apply language-specific sandhi rules to account for phonological alterations. Sandhi: 1 == 2 == $letter 3 == 4 == 5 == 6 == >e r> 7 == $longUnrounded Vowel> 8 == e % canIe => cane 9 == % cap i i Ɲ bam -> capieebam 10 == ¯o> 11 == 12 == 13 == % moneeƗm -> moneƗm 14 == % audƯoo -> audioo 15 == % audƯees -> audiees 16 == % audƯƗm -> audiƗm -> 17 == % audƯnjnt -> audinjnt 18 == % frnjctnj-um -> frnjctu-um; cornnja-> 19 ==

20 ==

21 == % dnjceebƗunt -> dnjceebƗnt -> .. 22 == 23 == 24 ==

Unlike other nodes, Sandhi works strictly from left to right, dealing with a few atoms at a time. The first rule removes the wordEnd marker if that is all that is left. The other rules simplify the beginning of the remaining string of letters and then use angle brackets to direct the modified string back to the Sandhinode. Rule 2 is quite general: if no more specific rule applies, it takes a single letter from the query as output, and directs the remainder of the query back to Sandhi. Rule 3 converts forms like *dşcimusr to *dşcimur. Rules 4-8 deal with the morphophoneme I, typically converting it to i (Rule 5), but sometimes converting it to e or removing it entirely. Rules 9-18 deal with two vowels in a row; typically, the second vowel is retained, and the first either disappears or shortens. Rules 19 and 20 shorten long vowels in certain contexts by applying to the Shorten rule, which we omit here. Finally, Rules 22-24 introduce spelling rules. We include them in under the rubric of sandhi.

STRATEGIES FOR BUILDING KATR THEORIES We have been applying KATR to natural-language morphology for several years. In addition to Latin, we have built a complete morphology of Hebrew verbs (Finkel 2007b), large parts of Sanskrit (and other related languages), and smaller studies of Bulgarian, Swahili, Georgian, Lingala, Spanish, Polish, and Turkish. KATR allows us to represent morphological rules for these languages with great elegance. Writing specifications in KATR is not easy. KATR is capable of representing elegant theories, but arriving at those theories requires considerable effort. Early choices color the structure of the resulting theory, and the author must often discard attempts and rethink how to represent the target morphology. The hardest

FINKEL AND STUMP

277

choice is often whether to model a form by introducing a sandhi rule or a formative rule. An example is the -mur suffix that marks first person plural passive. We choose to model this ending as mus + r and to reduce the result by sandhi. We could have introduced instead a rule in the Desinence node: 4.5 {1 pl passive} = m u

and let the r appear due to the SuffixVoice node. In this case, we prefer not to introduce a special rule in Desinence, partially because it looks so similar to the existing Rule 4, and therefore seems unparsimonious, and partially because we hypothesize that historically there really was some form *-musr that eventually elided the two final consonants.

AN IMPLICATIVE KATR THEORY FOR LATIN The Paradigm Chart We start this analysis by presenting a paradigm of word forms for Latin verbs. Table 2 displays a subset of the entire paradigm that covers only the present indicative active forms. The roots of lexemes are abstracted away from this paradigm.

278

CHANGING THE CENTER OF GRAVITY

CONJ TEMPLATE cIa cIb cIc cIIa cIIb cIIc cIId cIIe cIIIa cIIIb cIIIc cIIId cIIIe cIIIf

PrIAc1s PrIAc2s PrIAc3s PrIAc1p PrIAc2p PrIAc3p 4S1C 1S1Cs 1S1Ct 4S1Cmus 1S1Ctis 4S1Cnt Ň Ň Ň eŇ eŇ eŇ eŇ eŇ Ň Ň Ň Ň iŇ Ň

Ć Ć Ć ē ē ē ē ē i i i i i

Ć Ć Ć ē ē ē ē ē i i i i i

Ć Ć Ć ē ē ē ē ē i i i i i

Ć Ć Ć ē ē ē ē ē i i i i i

Ć Ć Ć ē ē ē ē ē i i i i i

Ӆ

Ӆ

um iŇ iŇ iŇ iŇ

Ӆ ĩ ĩ ĩ ĩ

Ӆ u

cIVa cIVb cIVc cIVd

Ӆ ĩ ĩ ĩ ĩ

Ӆ u

Ӆ

cIIIs

ĩ ĩ ĩ ĩ

Ӆ ĩ ĩ ĩ ĩ

ĩ ĩ ĩ ĩ

Table 2. Latin paradigm fragment (6 of the 92 columns)

We have expanded the traditional four conjugations into 19 conjugations. They are mostly distinguished by forms not shown here, particularly by the perfect indicative active. For instance, a cIa verb such as iuvĆre "help" forms the perfect stem by lengthening the middle vowel, as in işvĩ "I helped," whereas a cIb verb such as laudĆre "praise" simply adds Ćvĩ to the stem to form the perfect 1 singular. For completeness, we even include conjugation cIIIs for the two exceptional verbs esse "be" and posse "be able."

FINKEL AND STUMP

279

The line marked CONJ simply lists the morphosyntactic property sets for our convenience. The line marked TEMPLATE indicates parts of the chart that are constant within each column. For instance, the second-person singular is marked with 1S1Cs. The 1S part means "place the first stem here." We choose to make the first stem the present stem. Next, 1C means "place the first entry in the column here." Our Latin charts have only a single entry per column in each row. When an entry is empty, we mark it with Ӆ. Finally, s means "place the letter s here". All Latin verbs follow this strategy for building the perfect indicative active 2nd person singular forms. Our chart requires that each verb have five stems, possibly identical: present, perfect, supine, present first person, and present past subjunctive. For iuvĆre "help," these stems are iuv, iş, iş, iuv, and iuv, respectively. For esse "be," the stems are es, fu, es, s, and es, respectively. It often happens that a conjugation regularly refers one stem to another. For instance, class cIa refers the fourth and fifth stems to the first; class cIIIs refers the third to first. (For consistency, we always refer stems to earlier-numbered stems.) We represent stem referrals as an addendum to our chart, as in Table 3.

280

CHANGING THE CENTER OF GRAVITY

. REFER cIa 4 , 5 -> 1 REFER cIb 2 - 5 -> 1 REFER cIc 2 - 5 -> 1 REFER cIIa 2 - 5 -> 1 REFER cIIb 2 - 5 -> 1 REFER cIIc 2 - 5 -> 1 REFER cIId 4 , 5 -> 1 ; 3 -> 2 REFER cIIe 4 , 5 -> 1 REFER cIIIa 2 - 5 -> 1 REFER cIIIb 2 - 5 -> 1 REFER cIIIc 2 - 5 -> 1 REFER cIIId 2 - 5 -> 1 REFER cIIIs 3 -> 1 REFER cIIIe 3 - 5 -> 1 REFER cIVa 3 - 5 -> 1 REFER cIIIf 4 - 5 -> 1 REFER cIVb 2 - 5 -> 1 REFER cIVc 2 - 5 -> 1 REFER cIVd 2 - 5 -> 1 Table 3. Latin stem referrals

We then represent the lexicon by indicating the conjugation and stems of each verb as another addendum to the chart, as in Table 4.

FINKEL AND STUMP

281

LEXEME help cIa 1:iuv 2:ded 3:iş LEXEME bathe cIa 1:lav 2:lĆv 3:lau LEXEME stand cIa 1:st 2:stet 3:sta LEXEME give cIa 1:d 2:ded 3:da LEXEME praise cIb 1:laud LEXEME rattle cIc 1:crep LEXEME destroy cIIa 1:dēl LEXEME mourn cIIb 1:lşg LEXEME order cIIb 1:iub LEXEME warn cIIc 1:mon LEXEME see cIId 1:vid 2:vĩd LEXEME arouse cIIe 1:ci 2:cĩ 3:ci LEXEME decide cIIIa 1:dēcern LEXEME nourish cIIIb 1:al LEXEME lead cIIIc 1:dşc LEXEME attach cIIId 1:fĩg LEXEME take cIIIe 1:cap 2:cēp LEXEME carry cIIIf 1:fer 2:tul 3:lĆt LEXEME be cIIIs 1:es 2:fu 4:s LEXEME be able cIIIs 1:potes 2:possu 4:poss 5:pos LEXEME go cIIIs 1:i 2:i 4:e 5:ĩ LEXEME come cIVa 1:ven 2:vēn LEXEME hear cIVb 1:aud LEXEME leap cIVc 1:sal LEXEME bind cIVd 1:vinc Table 4. Latin lexicon

The final section of the paradigm chart expresses rules of sandhi, as shown in Table 5.

282

CHANGING THE CENTER OF GRAVITY

CLASS finalStop r m t nt SANDHI s s | => s % esst => est SANDHI s r => s s % esret => esset SANDHI s b => r % esbam => eram SANDHI Ć\verb+\+s*[:finalStop:] | => a $1 SANDHI ĩ\verb+\+s*[:finalStop:] | => i $1 SANDHI ē\verb+\+s*[:finalStop:] | => e $1 SANDHI Ň\verb+\+s*[:finalStop:] | => o $1 SANDHI ş\verb+\+s*[:finalStop:] | => u $1 SANDHI g s => x % lugsi => luxi SANDHI b s => s s % iubsĩ => iussĩ SANDHI b t => s s % iubtum => iussum SANDHI g t => c t % lugtum => luctum SANDHI c s => x % ducsi => duxi SANDHI d s => s % vĩdsum => vĩsum SANDHI rn t => rt % dēcerntum => dēcertum Table 5. Latin sandhi rules

These rules sometimes truly express sandhi, such as the rules for shortening long vowels before a final stop. Others are merely spelling rules, such as converting cs to x. We introduce a few in order to make our paradigm more regular, such as changing sb to r. In many cases, we indicate the situation that led us to introduce the rule by a comment starting with %. Deriving the Essence of the Paradigm Before analyzing the paradigm, we reduce it to its essence. The first reduction removes identical conjugations. This situation does not arise in Latin, but it does in French, where 71 of the 149 conjugations listed in the Larousse Dictionary are redundant after we present them on our paradigm form, and another 11 are identical except for stem-referral pattern. The next step is to remove redundant columns (morphosyntactic property sets, or MPSs). For example, although the second and third person verb forms are different in the present indicative

FINKEL AND STUMP

283

active, the columns associated with those forms are identical. The differences are all covered by the template and sandhi rules. Of the 92 MPSs in Latin, there are only 14 unique ones. The last step is to remove essentially identical columns. Two columns are essentially identical if there is a one-to-one and onto mapping between the exponences (cell contents) found in those columns. For example, the future perfect indicative active 2nd singular MPS is essentially the same as another MPS. Our analysis reduces the chart from 92 MPSs to 9 important ones, which we call distillations. We call the reduced paradigm chart the essence. The essence does not contain actual exponences; we substitute unique symbols instead. Table 6 presents the essence of the Latin paradigm. An entry like e4_2 means that this distillation is based on the fourth MPS (which happens to be PrIAc1p ) and has the second exponence found in that MPS (which happens to be ē).

284

CHANGING THE CENTER OF GRAVITY

cIa e1_1 e2_1 e4_1 e13_1 e25_1 e37_1 e55_1 e58_1 e92_1 cIb e1_1 e2_1 e4_1 e13_1 e25_1 e37_2 e55_1 e58_2 e92_2 cIc e1_1 e2_1 e4_1 e13_1 e25_1 e37_3 e55_1 e58_2 e92_3 cIIa e1_2 e2_2 e4_2 e13_2 e25_2 e37_4 e55_2 e58_3 e92_4 cIIb e1_2 e2_2 e4_2 e13_2 e25_2 e37_5 e55_2 e58_3 e92_1 cIIc e1_2 e2_2 e4_2 e13_2 e25_2 e37_3 e55_2 e58_3 e92_3 cIId e1_2 e2_2 e4_2 e13_2 e25_2 e37_1 e55_2 e58_3 e92_5 cIIe e1_2 e2_2 e4_2 e13_2 e25_2 e37_6 e55_2 e58_3 e92_1 cIIIa e1_1 e2_3 e4_3 e13_2 e25_3 e37_1 e55_3 e58_4 e92_1 cIIIb e1_1 e2_3 e4_3 e13_2 e25_3 e37_3 e55_3 e58_4 e92_1 cIIIc e1_1 e2_3 e4_3 e13_2 e25_3 e37_5 e55_3 e58_4 e92_1 cIIId e1_1 e2_3 e4_3 e13_2 e25_3 e37_5 e55_3 e58_4 e92_5 cIIIe e1_3 e2_3 e4_3 e13_3 e25_4 e37_1 e55_4 e58_5 e92_1 cIIIf e1_1 e2_4 e4_4 e13_2 e25_3 e37_1 e55_5 e58_1 e92_6 cIIIs e1_4 e2_4 e4_5 e13_4 e25_5 e37_1 e55_6 e58_6 e92_0 cIVa e1_3 e2_5 e4_6 e13_3 e25_4 e37_1 e55_4 e58_5 e92_1 cIVb e1_3 e2_5 e4_6 e13_3 e25_4 e37_7 e55_4 e58_5 e92_7 cIVc e1_3 e2_5 e4_6 e13_3 e25_4 e37_3 e55_4 e58_5 e92_1 cIVd e1_3 e2_5 e4_6 e13_3 e25_4 e37_5 e55_4 e58_5 e92_1 Table 6. Essence of the Latin paradigm

Principal Parts Intuitively, a set of principal parts is a minimal subset of MPSs so that if one knows the exponences of those MPSs for a particular verb, one can deduce the verb's conjugation, from which one can deduce all the other MPSs of the verb. The practical utility of principal parts for language pedagogy has long been recognized. Generations of Latin students have learned that each verb in Latin has four principal parts (present indicative active first person singular, active infinitive, perfect indicative active first person singular, perfect passive participle (neuter nominative singular)). If one knows laudŇ, laudĆre, laudĆvĩ, laudĆtum, one knows enough to place the

FINKEL AND STUMP

285

verb in conjugation 1, from which one can determine the exponences of all the other MPSs by reference to the paradigm chart. We can define several kinds of principal-part systems (Finkel 2007a). x Static: a set of MPSs that applies for all verbs in the chart. Given the exponences of a verb for those MPSs, one can deduce its conjugation. Static principal parts are equivalent to the traditional understanding. x Adaptive: a tree of MPSs. Given the exponence of a verb for the MPS at the root of the tree, one can select an appropriate subtree and recurse. The leaves of the tree are conjugations. x Dynamic: a set of {MPS, exponence} pairs for each conjugation. If a verb agrees with a set of pairs, it belongs to the associated conjugation. We have built a program that takes the essence of a paradigm and computes its principal-part systems. For Latin, we find that there are, in fact, four static principal parts, with ten variations, as shown in Table 7. It is a bit surprising that the infinitive does not figure into any of these variations. However, the paradigm from which we calculate this result places the infinitive MPS almost at the end. It turns out that the exponences for the infinitive are identical to the exponences for the imperfect subjunctive active first person singular. For laudŇ, for instance, both show Ć, one using that exponence to form the infinitive laudĆre, and the other to form the subjunctive laudĆrem. In turn the MPS for the imperfect subjunctive active first person singular is essentially identical to the MPS for the present indicative active second person singular. Therefore, the first variation in Table 7 is the traditional set of principal parts.

286

CHANGING THE CENTER OF GRAVITY

. 1 Present 1 sg, Present 2 sg, Perfect 1 sg, Supine 2 Present 1 sg, Present 1 pl, Perfect 1 sg, Supine 3 Present 2 sg, Imperfect 1 sg, Perfect 1 sg, Supine 4 Present 2 sg, Future perfect 1 sg, Perfect 1 sg, Supine 5 Present 2 sg, Perfect 1 sg, Present subj 1 sg, Supine 6 Present 2 sg, Perfect 1 sg, Present subj 1 pl, Supine 7 Present 1 pl, Imperfect 1 sg, Perfect 1 sg, Supine 8 Present 1 pl, Future perfect 1 sg, Perfect 1 sg, Supine 9 Present 1 pl, Perfect 1 sg, Present subj 1 sg, Supine 10 Present 1 pl, Perfect 1 sg, Present subj 1 pl, Supine Table 7. Latin static principal parts (indicative active unless otherwise marked)

Our program computes one of the possible trees representing adaptive principal parts, as shown in Figure 2, which also shows a representative verb from each conjugation. Some conjugations can be determined with only two adaptive principal parts. For instance, if the present second person singular form uses Ćs and the perfect first person singular form is ĩ, then the conjugation is cIa, as in iuvĆre "help." Others require three adaptive principal parts, such as dşcere "lead." However, no conjugation needs four principal parts. This analysis shows that the most important distinction, the one at the top of the tree, is based on the present indicative active second person singular form, which we note above is essentially the same as the active infinitive form.

FINKEL AND STUMP

287

Figure 2. Latin adaptive principal parts (all indicative active)

We also compute dynamic principal parts. Table 8 displays one set for each conjugation; in general, each conjugation has several variations. This analysis shows that many conjugations can be

288

CHANGING THE CENTER OF GRAVITY

completely determined by a single principal part. For example, if the perfect indicative active 1 sg form of a verb is -Ćvĩ, the verb is in conjugation cIb. Only one conjugation, cIIIc, requires three exponences to determine all its forms. cIa Present 2 sg, Present subjunctive 1 pl cIb Perfect 1 sg cIc Present subjunctive 1 pl, Supine cIIa Perfect 1 sg cIIb Present 1 sg, Perfect 1 sg cIIc Present 1 sg, Supine cIId Present 1 sg, Perfect 1 sg cIIe Perfect 1 sg cIIIa Perfect 1 sg, Present subjunctive 1 sg cIIIb Perfect 1 sg, Present subjunctive 1 sg cIIIc Perfect 1 sg, Present subjunctive 1 sg, Supine cIIId Present subjunctive 1 sg, Supine cIIIe Present 1 sg, Present 2 sg cIIIf Present 1 pl cIIIs Present 1 sg cIVa Present 2 sg, Perfect 1 sg cIVb Perfect 1 sg cIVc Present 2 sg, Perfect 1 sg cIVd Present 2 sg, Perfect 1 sg Table 8.Latin dynamic principal parts (indicative active unless otherwise marked)

Grouping Computing the adaptive principal parts produces one way to see the interrelation of the conjugations, as shown in Figure 2. We can compute the interrelation in a more direct way by using an algorithm based on Huffman encoding (Huffman 2007). We define the distance between two conjugations as the number of distillations on which they disagree. We repeatedly find the two conjugations of minimum distance, delete them from the set of conjugations, com-

FINKEL AND STUMP

289

bine them, and insert the result, a pseudo-conjugation, back into the set of conjugations. A pseudo-conjugation has a compound value for those distillations where the two conjugations disagree. We consider a compound value to be of distance 0 from any superset or subset. This algorithm leads to multiple possible analyses, because we may be able to choose among several minimum pairs. Each analysis is a taxonomic tree. Figure 3 shows one tree that our program produces. The entries like e13_2 refer to the essence of Figure 3. They show in what way each node in the tree is distinguished from its siblings. For instance, conjugations cIIb and cIIe are distinguished by e37, which is perfect indicative active first person singular.

Figure 3. Latin conjugation groups

290

CHANGING THE CENTER OF GRAVITY

Figure 3 verifies that the usual conjugation nomenclature is reasonable. All three cI conjugations are close to each other, although cIa is a slight outlier. Conjugations cIIIaîd are very close to each other, but cIIIf (ferre) is significantly farther away. Conjugation cIIIs (esse) is still farther. Strangely, conjugation cIIIe (capere) is grouped with the cIV conjugations; apparently, i-stem third conjugation verbs share more connection with the fourth conjugation than the third. Generating a KATR Theory We can generate a KATR theory directly from the paradigm of Table 2 along with the stem-referral rules of Table 3, and the lexicon of Table 4. We can take advantage of the grouping (Figure 3) to generate a fairly compact KATR theory. Figure 4 shows a fragment of the computed KATR theory. The Help node introduces the three stems required by conjugation cIa. It refers all other requests to the CONJcIa node, which refers the remaining stems to the first stem. It also provisions the version of distillation e58 to have variant 1. It refers other requests to a chain of grouping nodes, here shown as Join12, Join15, Join17, and Join18, each of which provisions some distillations and hands of other requests to the next node in the chain. Finally, node Join18 refers all morphological queries to EXPAND. This node, of which we show only a small piece, combines the result of referring to nodes MPS1 through MPS92, for each one looking up the appropriate exponence. MPS1, which generates the present indicative active first person singular form, invokes node T02 with a parameter that depends on the value of the e1 distillation. Finally, the T02 node looks up the appropriate stem (in this case, stem 4) and combines it with the given ending.

FINKEL AND STUMP

291

Figure 4. Automatically generated KATR theory (fragment)

Table 9 shows some of the output that KATR generates for this theory. We use such output to verify that we have correctly captured the original paradigm in our chart of Table 2. iuvŇ iuvĆs iuvat iuvĆmus iuvĆtis iuvant laudŇ laudĆs laudat laudĆmus laudĆtis laudant moneŇ monēs monet monēmus monētis monent dşco dşcis dşcit dşcimus dşcitis dşcunt sum es est sumus estis sunt Table 9. Output of automatically generated KATR theory (fragment)

292

CHANGING THE CENTER OF GRAVITY

CONCLUSION This exercise demonstrates that both the realizational and the implicative approaches to defining language morphology lead to effective descriptions, as evidenced by the KATR theories they produce. An advantage of the realizational approach is that it allows us to apply language-specific knowledge and insight to create a default inheritance hierarchy that captures the morphological structure of the language, with slots pertaining to different morphosyntactic properties. However, as we have noted elsewhere (Finkel 2007b) writing KATR specifications requires considerable effort. Early choices color the structure of the resulting theory, and the author must often discard attempts and rethink how to represent the target morphology. We have built KATR theories for verbs in Hebrew, Slovak, Polish, Spanish, and Lingala (a Bantu language of the Congo), as well as for parts of Hungarian, Sanskrit, and Pali. The implicative approach is much more automatic. One still needs to manually construct the initial paradigm, decide how many stems are needed (for Latin, we use five; for French, we use 15), and abstract as much information as possible into the templates for each MPS. After that, we can use automatic methods that reduce the paradigm chart to its essence, group conjugations, and generate an effective KATR theory. These steps take only a few seconds to complete (on a 1.8GHz Intel Pentium running Linux, only one second). It is a simple (albeit tedious) matter to verify that all the forms the KATR theory generates are accurate. The KATR theory itself is fairly compact, taking advantage of grouping. However, it is about twice the size of the hand-built theory (measured in characters). More important, it doesn't clearly delineate the slots of the exponences. It is therefore somewhat less satisfying, somewhat less informative, than the KATR theory we build manually following the realizational approach. We have applied the implicative approach to French (both as spelled and as pronounced), Hebrew, and Yiddish, as well as some lesser-known languages, such as Comaltepec Chinantec (Oto-Manguean, spoken in Oaxaca, Mexico), Fur (Nilo-Saharan, in Darfur), and Sora (Austro-Asiatic, India). The implicative approach also has the advantage that it allows us to analyze the principal parts of the language based solely on the exponences in the paradigm chart. We have taken advantage of that ability elsewhere to characterize languages based on properties of their principal parts (Finkel 2007a). For example, Latin, along with

FINKEL AND STUMP

293

Sanskrit, but in strong contradistinction to Comaltepec Chinantec, has a very orthogonal set of principal parts: Each principal part tends to govern a disjoint set of MPSs. As Latin developed into the Romance languages, its scheme of conjugations and principal parts evolved. Initial investigation of French, following the same implicative approach as that shown here for Latin, shows that 9 static principal parts are needed to distinguish the 67 distinct conjugations. Ignoring spelling and considering only pronunciation, we have been able to reduce this total to 7 principal parts distinguishing 35 conjugations. This investigation continues.

ACKNOWLEDGMENTS We would like to thank Lei Shen and Suresh Thesayi, who were instrumental in implementing our Java™ version of KATR. Nancy Snoke assisted in implementing our Perl/Prolog version. This work was partially supported by the National Science Foundation under Grants IIS-0097278 and IIS-0325063 and by the University of Kentucky Center for Computational Science. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

GLOSSARY Cell: A position in a full table of word forms, where the row is the inflection class (such as conjugation 1), and the column represents a set of morphosyntactic properties (such as first person singular present indicative active). Desinence: An inflectional ending, usually added to a stem according to its syntactic context. For example, amat "he/she loves" has the desinence -t. Diacritic: A marker of a particular morphophonological property. For example, the fact that a verb is in conjugation 4 is a diacritic. Exponence: The contents of a cell for a given lexeme, such as amat. Morphophoneme: A phonological unit whose phonemic expression depends on its context. For example in our Latin KATR theory, we use I as the phonological unit in conjugation 3 (i-stem)

294

CHANGING THE CENTER OF GRAVITY

that is either expressed as the phoneme i (as in capiŇ "I grab"), as the phoneme e (before r, as in capere "to grab"), or disappears entirely (such as before ĩ, as in cēpĩ "I have grabbed"). Node: A set of rules in a KATR theory to which a query is directed. The particular rule to apply depends on the query. Some nodes refer to others, leading to a hierarchical node structure. Sandhi: Rules of euphony or spelling. For example ēŇ is pronounced eŇ, as in videŇ "I see," and cs is spelled x, as in dşxĩ "I have led."

BIBLIOGRAPHY Anderson 1992 Anderson, S. R. 1992 Amorphous morphology, Cambridge University Press. Blevins 2005 Blevins, J. P. 2005 "Word-based Declensions in Estonian," in G. Booij & J. van Marle (eds.), Yearbook of Morphology 2005, Springer, Dordrecht, pp. 1–25. Blevins 2006 Blevins, J. P. 2006 "Word-based Morphology," Journal of Linguistics 42: 531– 573. Corbett 1993 Corbett, G. G. & Fraser, N. M. 1993 "Network Morphology: A DATR Account of Russian Nominal Inflection," Journal of Linguistics 29: 113– 142. Evans 1989 Evans, R. & Gazdar, G. 1989 "Inference in DATR, Proceedings of the Fourth Conference of the European Chapter of the Association for Computational Linguistics," Manchester, pp. 66–71. Finkel 2002 Finkel, R., Shen, L., Stump, G. & Thesayi, S. 2002 "KATR: A Set-based Extension of DATR," Technical Report 346-02, University of Kentucky Department of Computer Science, Lexington, KY. ftp://ftp.cs.uky.edu/cs/techreports/346-02.pdf. Finkel 2007a Finkel, R. A. & Stump, G. T. (2007a). "Principal Parts and Morphological Typology," Morphology 17: 39–75. Finkel 2007b Finkel, R. & Stump, G. (2007b). "A Default Inheritance Hierarchy for Computing Hebrew Verb Morphology," Literary and Linguistic Computing 22(2): 117– 136. dx.doi.org/10.1093/llc/fqm004. Finkel 2008 Finkel, R. & Stump, G. 2008 "Principal Parts and Degrees of Paradigmatic Transparency," in J. P. Blevins & J. Blevins (eds), Analogy in Grammar: Form and Acquisition, Oxford University Press, Oxford.

FINKEL AND STUMP Huffman

295

2007 Huffman coding (2007). http://en.wikipedia.org/wiki/Huffman_coding Matthews 1972 Matthews, P. H. 1972 Inflectional Morphology, Cambridge University Press. Stump 2001 Stump, G. T. 2001 Inflectional Morphology, Cambridge University Press, Cambridge, England. Zwicky 1985 Zwicky, A. M. 1985 "How to Describe Inflection," Proceedings of the 11th Annual Meeting of the Berkeley Linguistics Society , pp. 372–386.

296

COMPUTATIONAL LINGUISTICS AND CLASSICAL LEXICOGRAPHY DAVID BAMMAN TUFTS UNIVERSITY [email protected]

GREGORY CRANE TUFTS UNIVERSITY

[email protected]

ABSTRACT Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. ...Great advances have been made in the sciences on which lexicography depends. Minute research in manuscript authorities has largely restored the texts of the classical writers, and

297

298

CHANGING THE CENTER OF GRAVITY even their orthography. Philology has traced the growth and history of thousands of words, and revealed meanings and shades of meaning which were long unknown. Syntax has been subjected to a profounder analysis. The history of ancient nations, the private life of the citizens, the thoughts and beliefs of their writers have been closely scrutinized in the light of accumulating information. Thus the student of to-day may justly demand of his Dictionary far more than the scholarship of thirty years ago could furnish. (Advertisement for the Lewis & Short Latin Dictionary, March 1, 1879.)

The “scholarship of thirty years ago” that Lewis and Short here distance themselves from is Andrews' 1850 Latin-English lexicon, itself largely a translation of Freund’s German Wörterbuch published only a decade before. As we design a cyberinfrastructure to support Classical Studies in the future, we will soon cross a similar milestone: the Oxford Latin Dictionary (1968-1982) has begun the slow process of becoming thirty years old (several of the earlier fascicles have already done so) and by 2012 the eclipse will be complete. Founded on the same lexicographic principles that produced the juggernaut Oxford English Dictionary, the OLD is a testament to the extraordinary results that rigorous manual labor can provide. It has, along with the Thesaurus Linguae Latinae, provided extremely thorough coverage for the texts of the Golden and Silver Age in Latin literature and has driven modern scholarship for the past thirty years. Manual methods, however, cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials, and as we think toward Classics in the next ten years, we must think not only of desiderata, but also of the means that would get us there. Like Lewis and Short, we can also say that great advances have been made over the past thirty years in the sciences underlying lexicography; but the “sciences” that we group in that statement include not only the traditional fields of paleography, philology, syntax and history, but computational linguistics and computer science as well. Lexicographers have long used computers as an aid in dictionary production, but the recent rise of statistical language processing now lets us do far more: instead of using computers to simply expedite our largely manual labor, we can now use them to uncover

BAMMAN AND CRANE

299

knowledge that would otherwise lie hidden in expanses of text. Digital methods also let us deal well with scale. For instance, while the OLD focused on a canon of Classical authors that ends around the second century CE, Latin continued to be a productive language for the ensuing two millennia, with prolific writers in the Middle Ages, Renaissance and beyond. The Index Thomisticus (Busa 1974-1980) alone contains 10.6 million words attributed to Thomas Aquinas and related authors, which is by itself larger than the entire corpus of extant classical Latin.1 Many handcrafted lexica exist for this period, from the scale of individual authors (cf. Ludwig Schütz’ 1895 Thomas-Lexikon) to entire periods (e.g., J. F. Niermeyer’s 1976 Mediae Latinitatis Lexikon Minus), but we can still do more: we can create a dynamic lexicon that can change and grow when fed with new texts, and that can present much more information about a word than reference works bound by the conventions of the printed page. In deciding how we want to design a cyberinfrastructure for Classics over the next ten years, there is an important question that lurks between “where are we now?” and “where do we want to be?”: where are our colleagues already? Computational linguistics and natural language processing generally perform best in highresource languages – languages like English, on which computational research has been focusing for over sixty years, and for which expensive resources (such as treebanks, ontologies and large, curated corpora) have long been developed. Many of the tools we would want in the future are founded on technologies that already exist for English and other languages; our task in designing a cyberinfrastructure may simply be to transfer and customize them for Classical Studies. Classics has arguably the most well-curated collection of texts in the world, and the uses its scholars demand from that collection are unique. In the following I will document the technologies available to us in creating a new kind of reference

The Biblioteca Teubneriana BTL-1 collection, for instance, contains 6.6 million words, covering Latin literature up to the second century CE. For a recent overview of the Index Thomisticus, including the corpus size and composition, see Busa (2004). 1

300

CHANGING THE CENTER OF GRAVITY

work for the future – one that complements the traditional lexicography exemplified by the OLD and the TLL and lets scholars interact with their texts in new and exciting ways.

WHERE ARE WE NOW? In answering this question, I am mainly concerned with two issues: the production of reference works (i.e., the act of lexicography) and the use that scholars make of them. All of the reference works available in Classics are the products of manual labor, in which highly skilled individuals find examples of a word in context, cluster those examples into distinguishable “senses,” and label those senses with a word or phrase in another language (like English) or in the source language (as with the TLL). In the past thirty years, computers have allowed this process to be significantly expedited, even in such simple ways as textual searching. Rather than relying on a vast network of volunteer readers to read through scores of books and write down “apt” sentences as they come across them (as with the OED), we can simply search our electronic corpora, find all examples of a word in context, and winnow through them sequentially to find those that most clearly illuminate the meaning of any given sense. This approach has been exploited most recently by the Greek Lexicon Project2 at the University of Cambridge, which has been developing a New Greek Lexicon since 1998 using a large database of electronically compiled slips (with a target completion date of 2010). Here the act of lexicography is still very manual, as each dictionary sense is still heavily curated, but the tedious job of citation collection is not. We can contrast this computer-assisted lexicography with a new variety – which we might more properly call “computational lexicography” – that has emerged with the COBUILD project (Sinclair 1987) of the late 1980s. The COBUILD English Language Dictionary (1987) is a learner’s dictionary centered around a word’s use in context, and is created from an analysis of an evolving English

2 See http://people.pwf.cam.ac.uk/blf10/GLP/Greek_Lexicon_Project.htm.

BAMMAN AND CRANE

301

textual corpus (the Bank of English, on which current editions of the COBUILD dictionary are based, was officially launched in 1991 and now includes 524 million words3). This corpus evidence allows lexicographers to include frequency information as part of a word’s entry (helping learners concentrate on common words) and also to include sentences from the corpus that demonstrate a word’s common collocations – the words and phrases that it frequently appears with. By keeping the underlying corpus up to date, the editors are also able to add new headwords as they appear in the language, and common multi-word expressions and idioms (such as bear fruit) can also be uncovered as well. This corpus-based approach has since been augmented in two dimensions. On the one hand, dictionaries and lexicographic resources are being built on larger and larger textual collections: the German elexiko project (Klosa et al. 2004), for instance, is built on a modern German corpus of 1.3 billion words, and we can expect much larger projects in the future as the web is exploited as a corpus.4 At the same time, researchers are also subjecting their corpora to more complex automatic processes to extract more knowledge from them. While word frequency and collocation analysis is fundamentally a task of simple counting, projects such as Kilgarriff’s Sketch Engine (Kilgarriff et al. 2004) also enable lexicographers to induce information about a word’s grammatical behavior as well. In their ability to include statistical information about a word’s actual use, these contemporary projects are exploiting advances in computational linguistics that have been made over the past thirty years. Before turning, however, to how we can adapt these technologies in the creation of a new and complementary reference work, we must first address the use of such lexica. Like the OED, Classical lexica generally include a list of citations under each headword, providing testimony by real authors for

See http://www.collins.co.uk/books.aspx?group=153. In 2006, for example, Google released the first version of its Web 1T 5-gram corpus (Brants and Franz 2006) – a collection of n-grams (n=1-5) and their frequencies calculated from 1 trillion words of text on the web. 3 4

302

CHANGING THE CENTER OF GRAVITY

each sense. Of necessity, these citations are usually only exemplary selections, though the TLL provides comprehensive listings by Classical authors for many of its lemmata. These citations essentially function as an index into the textual collection. If I am interested in the places in Classical literature where the verb libero means to acquit, I can consult the OLD and then turn to the source texts it cites: Cic. Ver. 1.72, Plin. Nat. 6.90, etc. For a more comprehensive (but not exhaustive) comparison, I can consult the TLL. This is what we might consider a manual form of “lemmatized searching.” The Perseus Digital Library5 and the Thesaurus Linguae Graecae6 both provide a form of lemmatized searching for their respective texts, but it is a fuzzier variety than that presented here: a user can search for a word form such as edo (to eat) and simultaneously search the texts for all of its various inflections, but ambiguity is rampant - a lemmatized search for edo would also search for est, which is also an inflection of the far more common sum (to be). The search results are thus significantly diluted by a large number of false positives. The advantage of the Perseus and TLG lemmatized search is that it gives scholars the opportunity to find all the instances of a given word form or lemma in the textual collections they each contain. The TLL may be built on a comprehensive collection of 10 million slips containing all of Latin literature up to 200 CE and selections beyond, but that complete collection can only be found housed in their archives; what we have in print and on CD-ROM is still only a sample. The TLL, however, is impeccable in precision, while the Perseus and TLG results are dirty. What we need is a resource to combine the best of both.

WHERE DO WE WANT TO BE? The OLD and TLL are not likely to become obsolete anytime soon; as the products of highly skilled editors and over a century of labor, the sense distinctions within them are highly precise and well

5 6

See http://www.perseus.tufts.edu/hopper/. See http://www.tlg.uci.edu/.

BAMMAN AND CRANE

303

substantiated. What we can provide in the near future, however, is a complement to these resources, one that presents statistics about a word’s actual usage in texts – and not only in texts from the Classical period, but from any era for which we have electronic corpora. Heavily curated reference works provide great detail for a small set of texts; our complement is to provide lesser detail for all texts. In order to accomplish this, we need to consider the role that automatic methods can play within our emerging cyberinfrastructure. I distinguish cyberinfrastructure from the vast corpora that exist for modern languages not only in the structure imposed upon the texts that comprise it, but also in the very composition of those texts: while modern reference corpora are typically of little interest in themselves (as mainly newswire), Classical texts have been the focus of scholars’ attention for millennia. The meaning of the word child in a single sentence from the Wall Street Journal is hardly a research question worth asking, except for the newspaper’s significance in being representative of the language at large; but this same question when asked of Vergil’s fourth Eclogue has been at the center of scholarly debate since the time of the emperor Constantine.7 We need to provide traditional scholars with the apparatus necessary to facilitate their own textual research. This will be true of a cyberinfrastructure for any historical culture, and for any future structure that develops for modern scholarly corpora as well. We therefore must concentrate on two problems. First, how much can we automatically learn from a large textual collection using machine learning techniques that thrive on large corpora? And second, how can the vast labor already invested in handcrafted lexica help those techniques to learn? What we can learn from such a corpus is actually quite significant. With a large bilingual corpus, we can induce a word sense inventory to establish a baseline for how frequently certain definitions of a word are manifested in actual use; we can also use the context surrounding each word to establish which particular definition is meant in any given instance. With the help of a treebank (a

7

See (Bourne 1916) for an overview of puer in Ec. IV.

304

CHANGING THE CENTER OF GRAVITY

handcrafted collection of syntactically parsed sentences), we can train an automatic parser to parse the sentences in a monolingual corpus and extract information about a word’s subcategorization frames (the common syntactic arguments it appears with – for instance, that the verb dono (to give) requires a subject, direct object and indirect object), and selectional preferences (e.g., that the subject of the verb amo (to love) is typically animate). With clustering techniques, we can establish the semantic similarity between two words based on their appearance in similar contexts. If we leverage all of these techniques to create a lexicon for both Latin and Greek, the lexical entries in each reference work could include the following: x a list of possible senses, weighted according to their probability; x a list of instances of each sense in the source texts; x a list of common subcategorization frames, weighted according to their probability; and x a list of selectional preferences, weighted according to their probability. In creating a lexicon with these features, we are exploring two strengths of automated methods: they can analyze not only very large bodies of data but also provide customized analysis for particular texts or collections. We can thus not only identify patterns in one hundred and fifty million words of later Latin but also compare which senses of which words appear in the one hundred and fifty thousand words of Thucydides. Figure 1 presents a mock-up of what a dictionary entry could look like in such a dynamic reference work. The first section (“Translation equivalents”) presents items 1 and 2 from the list, and is reminiscent of traditional lexica for classical languages: a list of possible definitions is provided along with examples of use. The main difference between a dynamic lexicon and those print lexica, however, lies in the scope of the examples: while print lexica select one or several highly illustrative examples of usage from a source text, we are in a position to present far more.

BAMMAN AND CRANE

Figure 1. Mock-up of a sample entry in a dynamic lexicon

305

306

CHANGING THE CENTER OF GRAVITY

HOW DO WE GET THERE? We have already begun work on a dynamic lexicon like that shown in Figure 1 (Bamman and Crane 2008). Our approach is to use already established methods in natural language processing; as such, our methodology involves the application of three core technologies: x identifying word senses from parallel texts; x locating the correct sense for a word using contextual information; and x parsing a text to extract important syntactic information. Each of these technologies has a long history of development both within the Perseus Project and in the natural language processing community at large. In the following I will detail how we can leverage them all to uncover large-scale usage patterns in a text Word Sense Induction Our work on building a Latin sense inventory from a small collection of parallel texts in our digital library is based on that of Brown et al. 1991 and Gale et al. 1992, who suggest that one way of objectively detecting the real senses of any given word is to analyze its translations: if a word is translated as two semantically distinct terms in another language, we have prima facie evidence that there is a real sense distinction. So, for example, the Greek word archê may be translated in one context as beginning and in another as empire, corresponding respectively to LSJ definitions I.1 and II.2. Finding all of the translation equivalents for any given word then becomes a task of aligning the source text with its translations, at the level of individual words. The Perseus Digital Library contains at least one English translation for most of its Latin and Greek prose and poetry source texts. Many of these translations are encoded under the same canonical citation scheme as their source, but must further be aligned at the sentence and word level before individual word translation probabilities can be calculated. The workflow for this process is shown in Figure 2.

BAMMAN AND CRANE

307

Figure 2. Alignment workflow

Since the XML files of both the source text and its translations are marked up with the same reference points, “chapter 1, section 1” of Tacitus' Annales is automatically aligned with its English translation (step 1). This results (for Latin at least) in aligned chunks of text that are 217 words long. These chunks are then aligned on a sentence level in step 2 using Moore’s Bilingual Sentence Aligner (Moore 2002), which aligns sentences that are 1-1 translations of each other with a very high precision (98.5% for a corpus of 10,000 English-Hindi sentence pairs (Singh and Husain 2005)). In step 3, we then align these 1-1 sentences using GIZA++ (Och and Ney 2003). Prior to alignment, all of the tokens in the source text and translation are lemmatized, where each word is replaced with all of the lemmas from which it can be inflected (for example, the Latin word est is replaced with sum1 edo1 and the English word is is replaced with be). This word alignment is performed in both directions in order to discover multi-word expressions (MWE's) in the source language.

308

CHANGING THE CENTER OF GRAVITY

Figure 3. Sample word alignment from GIZA++

Figure 3 shows the result of this word alignment (here with English as the source language). The original, pre-lemmatized Latin is salvum tu me esse cupisti (Cicero, Pro Plancio, chapter 33). The original English is you wished me to be safe. As a result of the lemmatization process, many source words are mapped to multiple words in the target – most often to lemmas which share a common inflection. For instance, during lemmatization, the Latin word esse is replaced with the two lemmas from which it can be derived – sum1 (to be) and edo1 (to eat). If the word alignment process maps the source word be to both of these lemmas in a given sentence (as in Figure 3), the translation probability is divided evenly between them. From these alignments we can calculate overall translation probabilities, which we currently present as an ordered list, as in Figure 4.

Figure 4. Sense inventory for oratio induced from parallel texts

The weighted list of translation equivalents we identify using this technique can provide the foundation for our further lexical

BAMMAN AND CRANE

309

work. In the example above, we have induced from our collection of parallel texts that the headword oratio is primarily used with two senses: speech and prayer. The granularity of the definitions in such a dynamic lexicon cannot approach that of human labor: the Lewis and Short Latin Dictionary, for instance, enumerates fourteen subsenses in varying degrees of granularity, from “speech” to “formal language” to the “power of oratory” and beyond. Our approach, however, does have two clear advantages which complement those of traditional lexica: first, this method allows us to include statistics about actual word usage in the corpus we derive it from. The use of oratio to signify prayer is not common in classical Latin, but since the corpus we induced this inventory from is largely composed of the Vulgate of Jerome, we are also able to mine this use of the word and include it in this list as well. Since the lexicon is dynamic, we can generate a sense inventory for an entire corpus or any part of it – so that if we were interested, for instance, in the use of oratio only until the second century CE, we can exclude the texts of Jerome from our analysis. And since we can run our word alignment at any time, we are always in a position to update the lexicon with the addition of new texts.

Figure 5. Sense inventory for the multi-word expression res publica induced from parallel texts

Second, our word alignment also maps multi-word expressions, so we can include significant collocations in our lexicon as well. This allows us to provide translation equivalents for idioms

310

CHANGING THE CENTER OF GRAVITY

and common phrases such as res publica (republic) or gratias ago (to give thanks). Word Sense Disambiguation Approaches to word sense disambiguation generally come in three varieties: x knowledge-based methods (Lesk 1986, Banerjee and Pedersen 2002), which rely on existing reference works with a clear structure such as dictionaries and Wordnets (Miller 1995); x supervised corpus methods (Grozea 2004), which train a classifier on a human-annotated sense corpus such as Semcor (Miller et al. 1993) or any of the SENSEVAL competition corpora (Mihalcea and Edmonds 2004); and x unsupervised corpus methods, which train classifiers on “raw,” unannotated text, either a monolingual corpus (McCarthy et al. 2004) or parallel texts (Brown et al. 1991, Tufis et al. 2004). Corpus methods (especially supervised methods) generally perform best in the SENSEVAL competitions – at SENSEVAL-3, the best system achieved an accuracy of 72.9% in the English lexical sample task and 65.1% in the English all-words task.8 Manually annotated corpora, however, are generally cost-prohibitive to create, and this is especially exacerbated with sense-tagged corpora, for which the human inter-annotator agreement is often low. Since the Perseus Digital Library contains two large monolingual corpora (the canon of Greek and Latin classical texts) and sizable parallel corpora as well, we have investigated using parallel texts for word sense disambiguation. This method uses the same techniques we used to create a sense inventory to disambiguate words in context. After we have a list of possible translation equivalents for a word, we can use the surrounding Latin or Greek context as an indicator for which sense is meant in texts where we

8 At the time of writing, the SEMEVAL-1/SENSEVAL-4 (2007) competition is currently underway.

BAMMAN AND CRANE

311

have no corresponding translation. There are several techniques available for deciding which sense is most appropriate given the context, and several different measures for what definition of “context” is most appropriate itself. One technique that we have experimented with is a naive Bayesian classifier (following Gale et al. 1992), with context defined as a sentence-level bag of words (all of the words in the sentence containing the word to be disambiguated contribute equally to its disambiguation). Bayesian classification is most commonly found in spam filtering. A filtering program can decide whether or not any given email message is spam by looking at the words that comprise it and comparing it to other messages that are already known to be spam – some words generally only appear in spam messages (e.g., viagra, refinance, opt-out, shocking), while others only appear in non-spam messages (archê, subcategorization), and some appear equally in both (and, your). By counting each word and the class (spam/not spam) it appears in, we can assign it a probability that it falls into one class or the other. We can also use this principle to disambiguate word senses by building a classifier for every sense and training it on sentences where we do know the correct sense for a word. Just as a spam filter is trained by a user explicitly labeling a message as spam, this classifier can be trained simply by the presence of an aligned translation. For instance, the Latin word spiritus has several senses, including spirit and wind. In our texts, when spiritus is translated as wind, it is accompanied by words like mons (mountain), ala (wing) or ventus (wind). When it is translated as spirit, its context has (more naturally) a religious tone, including words such as sanctus (holy) and omnipotens (all-powerful). If we are confronted with an instance of spiritus in a sentence for which we have no translation, we can disambiguate it as either spirit or wind by looking at its context in the original Latin.

312

CHANGING THE CENTER OF GRAVITY

Latin context word Mons Commotio Ventus Ala

English translation Mountain Commotion Wind Wing

Probability of accompanying spiritus = wind 98.3% 98.3% 95.2% 95.2%

Table 1.Latin contextual probabilities where spiritus = wind.

Latin context word Sanctus Testis Vivifico Omnipotens

English translation Holy Witness Make alive All-powerful

Probability of accompanying spiritus = spirit 99.9% 99.9% 99.9% 99.9%

Table 2. Latin contextual probabilities where spiritus = spirit.

Word sense disambiguation will be most helpful for the construction of a lexicon when we are attempting to determine the sense for words in context for the large body of later Latin literature for which there exists no English translation. By training a classifier on texts for which we do have translations, we will be able to determine the sense in texts for which we don’t: if the context of spiritus in a late Latin text includes words such as mons and ala, we can use the probabilities we induced from parallel texts to know with some degree of certainty that it refers to wind rather than spirit. This will enable us to include these later texts in our statistics on a word’s usage, and link these passages to the definition as well. Parsing Two of the features we would like to incorporate into a dynamic lexicon are based on a word’s role in syntax: subcategorization and selectional preference. A verb’s subcategorization frame is the set of possible combinations of surface syntactic arguments it can appear with. In linear, unlabeled phrase structure grammars, these frames take the form of, for example, NP PP (requiring a direct

BAMMAN AND CRANE

313

object + prepositional phrase, as in I gave a book to John) or NP NP (requiring two objects, as in I gave John a book). In a labeled dependency grammar, we can express a verb’s subcategorization as a combination of syntactic roles (e.g., OBJ OBJ). A predicate’s selectional preference specifies the type of argument it generally appears with. The verb to eat, for example, typically requires its object to be a thing that can be eaten and its subject to have animacy, unless used metaphorically. Selectional preference, however, can also be much more detailed, reflecting not only a word class (such as animate or human), but also individual words themselves. For instance, the kind of arguments used with the Latin verb libero (to free) are very different in Cicero and Jerome: Cicero, as an orator of the republic, commonly uses it to speak of liberation from periculum (danger), metus (fear), cura (care) and aes alienum (debt); Jerome, on the other hand, uses it to speak of liberation from a very different set of things, such as manus Aegyptorum (the hand of the Egyptians), os leonis (the mouth of the lion), and mors (death).9 These are syntactic qualities since each of these arguments bears a direct syntactic relation to their head as much as they hold a semantic place within the underlying argument structure. In order to extract this kind of subcategorization and selectional information from unstructured text, we first need to impose syntactic order on it. One option for imposing this kind of order is through manual annotation, but this option is not feasible here due to the sheer volume of data involved – even the more resourceful of such endeavors (such as the Penn Treebank (Marcus et al. 1994) or the Prague Dependency Treebank (Hajiÿ 1999)) take years to complete. A second, more practical option is to assign syntactic structure to a sentence using automatic methods. Great progress has been made in recent years in the area of syntactic parsing, both for phrase structure grammars (Charniak 2000, Collins 1999) and dependency grammars (Nivre et al. 2006, McDonald et al. 2005), with labeled dependency parsing achieving an accuracy rate approaching

9

See (Bamman and Crane 2007) for a summary of this work.

314

CHANGING THE CENTER OF GRAVITY

90% for English (a high resource, fixed word order language) and 80% for Czech (a relatively free word order language like Latin and Greek). Automatic parsing generally requires the presence of a treebank – a large collection of manually annotated sentences – and a treebank’s size directly correlates with parsing accuracy: the larger the treebank, the better the automatic analysis. We are currently in the process of creating a treebank for Latin, and have just begun work on a one-million-word treebank of Ancient Greek. Now in version 1.5, the Latin Dependency Treebank10 is composed of excerpts from eight texts, including Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil. Each sentence in the treebank has been manually annotated so that every word is assigned a syntactic relation, along with the lemma from which it is inflected and its morphological code (a composite of nine different morphological features: part of speech, person, number, tense, mood, voice, gender, case and degree). Based predominantly on the guidelines used for the Prague Dependency Treebank, our annotation style is also influenced by the Latin grammar of Pinkster (1990), and is founded on the principles of dependency grammar (Mel’ÿuk 1988). Dependency grammars differ from phrase-structure grammars in that they forego non-terminal phrasal categories and link words themselves to their immediate heads. This is an especially appropriate manner of representation for languages with a free word order (such as Latin and Czech), where the linear order of constituents is broken up with elements of other constituents. A dependency grammar representation, for example, of ista meam norit gloria canitiem (Propertius I.8.46) – “that glory would know my old age” – would look like the following:

10

See http://nlp.perseus.tufts.edu/syntax/treebank/.

BAMMAN AND CRANE

315

Figure 6. Dependency grammar representation of ista meam norit gloria canitiem ("that glory would know my old age")

While this treebank is still in its infancy, we can still use it to train a parser to parse the volumes of unstructured Latin in our collection. Our treebank is still too small to achieve state-ofthe-art results in parsing but we can still induce valuable lexical information from its output by using a large corpus and simple hypothesis testing techniques to outweigh the noise of the occasional error (Bamman and Crane 2008). The key to improving this parsing accuracy is to increase the size of the annotated treebank: the better the parser, the more accurate the syntactic information we can extract from our corpus.

BEYOND THE LEXICON These technologies, borrowed from computational linguistics, will give us the grounding to create a new kind of lexicon, one that presents information about a word’s actual usage. This lexicon resembles its more traditional print counterparts in that it is a work designed to be browsed: one looks up an individual headword and then reads its lexical entry. The technologies that will build this reference work, however, do so by processing a large Greek and Latin textual corpus. The results of this automatic processing go far beyond the construction of a single lexicon. I noted earlier that all scholarly dictionaries include a list of citations illustrating a word’s exemplary use. As Figure 1 shows, each entry in this new, dynamic lexicon ultimately ends with a list of canonical citations to fixed passages in the text. These citations are

316

CHANGING THE CENTER OF GRAVITY

again a natural index to a corpus, but since they are based in an electronic medium, they provide the foundation for truly advanced methods of textual searching – going beyond a search for individual word form (as in typical search engines) to word sense. Searching by word sense

Figure 7. Mock-up of a service to search Latin texts by English word sense

The ability to search a Latin or Greek text by an English translation equivalent is a close approximation to real cross-language information retrieval. Consider scholars researching Roman slavery: they could compare all passages where any number of Latin “slave” words appear, but this would lead to separate searches for servus, serva, ancilla, famulus, famula, minister, ministra, puer, puella etc. (and all of their inflections), plus many other less-common words. By searching for word sense, however, a scholar can simply search for slave and automatically be presented with all of the passages for which this translation equivalent applies. Figure 7 presents a mockup of what such a service could look like. Searching by word sense also allows us to investigate problems of changing orthography – both across authors and time: as Latin passes through the Middle Ages, for instance, the spelling of words changes dramatically even while meaning remains the same. So, for example, the diphthong ae is often reduced to e, and prevocalic ti is changed to ci. Even within a given time frame, spelling can

BAMMAN AND CRANE

317

vary, especially from poetry to prose. By allowing users to search for a sense rather than a specific word form, we can return all passages containing saeculum, saeclum, seculum and seclum – all valid forms for era. Additionally, we can automate this process to discover common words with multiple orthographic variations, and include these in our dynamic lexicon as well. Searching by selectional preference The ability to search by a predicate’s selectional preference is also a step toward semantic searching – the ability to search a text based on what it “means.” In building the lexicon, we automatically assign an argument structure to all of the verbs. Once this structure is in place, it can stay attached to our texts and thereby be searchable in the future, allowing us to search a text for the subjects and direct objects of any verb. Our scholar researching Roman slavery can use this information to search not only for passages where any slave has been freed (i.e., when any Latin variant of the English translation slave is the direct object of the active form of the verb libero), but also who was doing the freeing (who in such instances is the subject of that verb). This is a powerful resource that can give us much more information about a text than simple search engines currently allow.

CONCLUSION Manual lexicography has produced fantastic results for Classical languages, but as we design a cyberinfrastructure for Classics in the future, our aim must be to build a scaffolding that is essentially enabling: it must not only make historical languages more accessible on a functional level, but intellectually as well; it must give students the resources they need to understand a text while also providing scholars the tools to interact with it in whatever ways they see fit. In this a dynamic lexicon fills a gap left by traditional reference works. By creating a lexicon directly from a corpus of texts and then situating it within that corpus itself, we can let the two interact in ways that traditional lexica cannot. Even driven by the scholarship of the past thirty years, however, a dynamic lexicon cannot yet compete with the fine sense distinctions that traditional dictionaries make, and in this the two works are complementary. Classics, however, is only one field

318

CHANGING THE CENTER OF GRAVITY

among many concerned with the technologies underlying lexicography, and by relying on the techniques of other disciplines like computational linguistics and computer science, we can count on the future progress of disciplines far outside our own.

BIBLIOGRAPHY Andrews 1850 Andrews, E. A. (ed.) A Copious and Critical LatinEnglish Lexicon, Founded on the Larger Latin-German Lexicon of Dr. William Freund; With Additions and Corrections from the Lexicons of Gesner, Facciolati, Scheller, Georges, etc.. New York: Harper & Bros., 1850. Bamman and Crane 2007 Bamman, David and Gregory Crane. "The Latin Dependency Treebank in a Cultural Heritage Digital Library", Proceedings of the ACL Workshop on Language Technology for Cultural Heritage Data (2007). Bamman and Crane 2008 Bamman, David and Gregory Crane. "Building a Dynamic Lexicon from a Digital Library", Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. (2008). Banerjee and Pedersen 2002 Banerjee, Sid and Ted Pedersen. "An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet", Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (2002). Bourne 1916 Bourne, Ella. "The Messianic Prophecy in Vergil’s Fourth Eclogue", The Classical Journal 11.7 (1916). Brants and Franz 2006 Brants, Thorsten and Alex Franz. Web 1T 5gram Version 1. Philadelphia: Linguistic Data Consortium, 2006. Brown et al. 1991 Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra and Robert L. Mercer. "Word-sense disambiguation using statistical methods", Proceedings of the 29th Conference of the Association for Computational Linguistics (1991). Busa 1974-1980 Busa, Roberto. Index Thomisticus: sancti Thomae Aquinatis operum omnium indices et concordantiae, in quibus verborum omnium et singulorum formae et lemmata cum suis frequentiis et contextibus variis modis referuntur quaeque / consociata plurium opera atque electronico IBM automato usus digessit Robertus Busa SI. Stuttgart-Bad Cannstatt: FrommannHolzboog, 1974-1980.

BAMMAN AND CRANE

319

Busa 2004 Busa, Roberto. "Foreword: Perspectives on the Digital Humanities", Blackwell Companion to Digital Humanities. Oxford: Blackwell, 2004. Charniak 2000 Charniak, Eugene. "A Maximum-Entropy-Inspired Parser", Proceedings of NAACL (2000). Collins 1999 Collins, Michael. "Head-Driven Statistical Models for Natural Language Parsing", Ph.D. thesis. Philadelphia: University of Pennsylvania, 1999. Freund 1840 Freund, Wilhelm (ed.). Wörterbuch der lateinischen Sprache: nach historisch-genetischen Principien, mit steter Berücksichtigung der Grammatik, Synonymik und Alterthumskunde. Leipzig: Teubner, 1834-1840. Gale et al. 1992 Gale, William, Kenneth W. Church and David Yarowsky. "Using bilingual materials to develop word sense disambiguation methods", Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (1992). Glare 1982 Glare, P. G. W. (ed.). Oxford Latin Dictionary. Oxford: Oxford University Press, 1968-1982. Grozea 2004 Grozea, Christian. "Finding Optimal Parameter Settings for High Performance Word Sense Disambiguation", Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (2004). Hajiÿ 1999 Hajiÿ, Jan. "Building a Syntactically Annotated Corpus: The Prague Dependency Treebank", Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová. Prague: Charles University Press, 1999. Kilgarriff et al. 2004 Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell. "The Sketch Engine", Proceedings of EURALEX (2004). Klosa et al. 2004 Klosa, Annette, Ulrich Schnörch, and Petra Storjohann. "ELEXIKO – A Lexical and Lexicological, Corpus-based Hypertext Information System at the Institut für deutsche Sprache, Mannheim", Proceedings of the 12th Euralex International Congress (2006). Lesk 1986 Lesk, Michael. "Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone", Proceedings of the ACMSIGDOC Conference (1986).

320

CHANGING THE CENTER OF GRAVITY

Lewis and Short 1879 Lewis, Charles T. and Charles Short (eds.). A Latin Dictionary. Oxford: Clarendon Press, 1879. Liddell and Scott 1940 Liddell, Henry George and Robert Scott (eds.). A Greek-English Lexicon, revised and augmented throughout by Sir Henry Stuart Jones. Oxford: Clarendon Press, 1940. Marcus et al. 1994 Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. "Building a Large Annotated Corpus of English: The Penn Treebank", Computational Linguistics 19.2 (1994). McCarthy et al. 2004 McCarthy, Diana, Rob Koeling, Julie Weeds and John Carroll. "Finding Predominant Senses in Untagged Text", Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (2004). McDonald et al. 2005 McDonald, Ryan, Fernando Pereira, Kiril Ribarov, and Jan Hajiÿ. "Non-projective Dependency Parsing using Spanning Tree Algorithms", Proceedings of HLT/EMNLP (2005). Mel’ÿuk 1988 Mel’ÿuk, Igor A. Dependency Syntax: Theory and Practice. Albany: State University of New York Press, 1988. Mihalcea and Edmonds 2004 Mihalcea, Rada and Philip Edmonds (eds.). Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (2004). Miller 1995 Miller, George. "Wordnet: A Lexical Database", Communications of the ACM 38.11 (1995). Miller et al. 1993 Miller, George, Claudia Leacock, Randee Tengi, and Ross Bunker. "A Semantic Concordance", Proceedings of the ARPA Workshop on Human Language Technology (1993). Moore 2002 Moore, Robert C. "Fast and Accurate Sentence Alignment of Bilingual Corpora", AMTA '02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation (2002). Niermeyer 1976 Niermeyer, Jan Frederick. Mediae Latinitatis Lexicon Minus. Leiden: Brill, 1976. Nivre et al. 2006 Nivre, Joakim, Johan Hall, and Jens Nilsson. "MaltParser: A Data-Driven Parser-Generator for Dependency Parsing", Proceedings of the Fifth International Conference on Language Resources and Evaluation (2006).

BAMMAN AND CRANE

321

Och and Ney 2003 Och, Franz Josef and Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics 29.1 (2003). Pinkster 1990 Pinkster, Harm. Latin Syntax and Semantics. London: Routledge, 1990. Schütz 1895 Schütz, Ludwig. Thomas-Lexikon. Paderborn: F. Schoningh, 1895. Sinclair 1987 Sinclair, John M. (ed.). Looking Up: an account of the COBUILD project in lexical computing. Collins, 1987. Singh and Husain 2005 Singh, Anil Kumar and Samar Husain. "Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs", Proceedings of the ACL Workshop on Building and Using Parallel Texts (2005). TLL Thesaurus Linguae Latinae, fourth electronic edition. Munich: K. G. Saur, 2006. Tufis et al. 2004 Tufis, Dan, Radu Ion, and Nancy Ide. "FineGrained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets", Proceedings of the 20th International Conference on Computational Linguistics (2004).

CLASSICS IN THE MILLION BOOK LIBRARY GREGORY CRANE TUFTS UNIVERSITY GREGORY.CRANE@TUFTS@EDU

ALISON BABEU TUFTS UNIVERSITY [email protected]

DAVID BAMMAN TUFTS UNIVERSITY [email protected]

THOMAS BREUEL

TECHNICAL UNIVERSITY OF KAISERSLAUTERN [email protected]

LISA CERRATO TUFTS UNIVERSITY [email protected]

DANIEL DECKERS

HAMBURG UNIVERSITY [email protected]

323

324

CHANGING THE CENTER OF GRAVITY

ANKE LÜDELING

HUMBOLDT-UNIVERSITY, BERLIN [email protected],

DAVID MIMNO UNIVERSITY OF MASSACHUSETTS, AMHERST [email protected]

RASHMI SINGHAL TUFTS UNIVERSITY

[email protected],

DAVID A. SMITH UNIVERSITY OF MASSACHUSETTS, AMHERST [email protected]

AMIR ZELDES

HUMBOLDT-UNIVERSITY, BERLIN [email protected]

ABSTRACT In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.

CRANE ET AL

325

INTRODUCTION In a long span of time it is possible to see many things that you do not want to, and to suffer them, too. I set the limit of a man's life at seventy years; these seventy years have twenty-five thousand, two hundred days, leaving out the intercalary month. But if you make every other year longer by one month, so that the seasons agree opportunely, then there are thirty-five intercalary months during the seventy years, and from these months there are one thousand fifty days. Out of all these days in the seventy years, all twenty-six thousand, two hundred and fifty of them, not one brings anything at all like another. (Herodotus, Histories 1.32, tr. Godley)

In the first book of Herodotus’ Histories, the Athenian statesman Solon calculates that an average human life of seventy years contains roughly 25,000 days. If we could read a book every day of our lives, it would take a thousand years – almost forty generations – to work our way through one million books. It would take 10,000 years or four hundred generations to work through the ten million or so unique books that the original Google library partners contained in their collections.1 On October 28, 2008, Google announced an agreement with publishers that would allow libraries to provide, largely under a subscription basis, access through Google book search to some seven million books, including copyrighted materials.2 Google is providing immense scale but the scholarly significance is not so great as it might be: there is at present no way to understand what subset of the world’s knowledge that seven million volumes represents. Even if there were, scholars have no way of understanding in more than the most general way how the services that extract information from that collection work – what is missed? What biases are embedded in the system? Scholarship depends upon transparency, and we must be careful that we do

This number can be found in (Lavoie 2005). See http://googleblog.blogspot.com/2008/10/new-chapter-forgoogle-book-search.html. 1 2

326

CHANGING THE CENTER OF GRAVITY

not, in pursuing our immediate research projects, compromise our fundamental commitment to transparency. A day before the momentous Google announcement, another and arguably even more important milestone was crossed. On October 27, 2008, the number of books available from the Internet Archive exceeded 1,000,000. While the million books is only a fraction of the size of the seven million that Google boast, the million books available from the Internet Archive are freely downloadable – anyone can analyze them and publish the results. The collection available from the Internet Archive provides the foundation for transparent services and, even more important, transparent discourse. Open source services, carefully evaluated and publicly documented, applied to open content, freely downloadable by anyone without restrictions, embody the goals of scholarly and scientific practice. A million books alone would support a book-of-the-day club for almost 3000 years. Thus, even if we restrict ourselves to digitized printed books available for public download in a single location, the scale of content available has already passed that which any single human mind could comprehend. As a physical collection, a million books is hardly remarkable. As a store of knowledge for human analysis, the scale of 1,000,000 books has already passed human scale and is as abstract as the distance between galaxies or the number of insects in the world. Only machines can process the collections to which we already in late 2008 have access. What can we do with a million books with the tools now at our disposal and which we could build? What are the research questions that emergent huge collections raise for the historians, literary critics, and other humanists who study their contents and for the computer and information scientists who develop methods with which to process digital information in general? This paper summarizes research, supported by a grant from the Mellon Foundation, into the challenges and opportunities confronting the humanities in general and classical studies in particular as we shift from small, carefully edited and curated digital collections to very large, industrially produced collections that, in their fullest instantiation, aim to subsume whole libraries. We view classical studies as a special case with the more general question that we have termed "what do you do with a million books?" The authors of this paper come from Europe as well as North America,

CRANE ET AL

327

from classical philology, computer science, corpus linguistics and library and information science, but many others contributed substantively to the work reported here during workshops that we conducted between November 2006 and March 2008.3 Previous publications have addressed some of the general issues that the humanities as a whole face.4 This paper explores the particular case of classics within the million-book library. When we began this study in 2006, we planned to focus upon materials related to the Greco-Roman world, early modern Europe, and the 19th century United States so that we could examine the varying problems and opportunities associated with print materials relating to each. As our work progressed, the advantages of focusing on classical studies became progressively clear. Classics includes not only the Greco-Roman world but the subsequent scholarship about the Greco-Roman world and a vast body of material written in Latin on virtually every subject. Beginning with early printed editions in the 1470s and continuing through the present, Classical scholarship brings us to every corner of Europe, North America, and the Middle East. Classical studies do not, of course, touch the same audiences as Shakespeare or the American Civil War, and there are not nearly so many Classicists as there are experts in English Literature and American History, but Classics has produced the largest coherent community of scholars engaged with the digital infrastructure for their field. Classical studies became a logical focus for our work: if we could understand how to build a comprehensive collection of classical scholarship from the beginning of print culture to the present, we would know how to work with centuries of print publications on every aspect of human society and in every discipline and from every corner of Europe and North America.

Workshops took place at the University of Chicago (November 2006), Tufts University (May 2007), the Council for Library and Information Resources (Washington, DC, November 2007), Imperial College London (March 2008) and Humboldt University in Berlin (March 2008). 4 (Crane 2006b); (Crane 2006); (Crane 2008) 3

328

CHANGING THE CENTER OF GRAVITY

This paper begins by stressing that we have moved beyond islands of digital content in a vast sea of print. Where our first generation collections were autonomous, carefully curated, disciplinespecific islands, we now see emerging a world where we dynamically generate collections of heterogeneous materials from vast and constantly expanding digital libraries over which no individual discipline or project exercises control. We cannot thus rely upon a centralized editorial structure to guarantee for us the consistency of what we find. We need tools that can help us assess how representative our automatically extracted corpus is (e.g., what biases are there in the distribution of Latin texts available for searching?) and the accuracy of our analytical tools (e.g., the precision and recall of named entity systems that search for Salamis in Cyprus vs. the Salamis near Athens, the error rates in Latin text that Optical Character Recognition (OCR) extracts from various editions printed in different places and times). Our discussion then moves to the services that humanists need to exploit very large collections. These include not only advanced services for information extraction, multilingual technologies, and visualization but simple access to the scanned page images with which to support domain-optimized document analysis. These services require the rise of a new, fourth generation of digital corpora. Our first digital corpora included accurate transcriptions with markup of surface features (e.g., we simply indicate that a word is in italics). A second generation began to add semantic markup (e.g., a phrase is in italics because it is the title of a work or a Latin quotation). The third generation created much larger collections by shifting the focus of manual labor from carefully edited typing to industrial scanning of page images. We need fourth-generation collections that can seamlessly integrate image-books, accurate transcriptions, and machine actionable knowledge in various formats. These fourth generation collections are a qualitatively new phenomenon. They allow us to design collections that are not only more comprehensive but more diverse than we could ever produce in print culture. These collections are unbounded and can include not only texts but every category of data about their subjects – high resolution images, three-dimensional models, geographic data sets, and anything that we can represent in digital form. Even if we restrict ourselves to linguistic data, fourth generation collections are a qualitative advance over print: we can include not only images of

CRANE ET AL

329

neatly printed modern books but non-print representations of language such as three dimensional models of words engraved on stone and digital sound recordings. For classics, the most important such project is what we have termed the apographeme of classical Greek and Latin – an analogy to the genome, representing the complete record of all Greek and Latin textual knowledge preserved from antiquity, ultimately including every inscription, papyrus, graffito, manuscript, printed edition and any writing bearing medium. This apographeme constitutes a superset of the capabilities and data that we inherit from print culture but it is a qualitatively different intellectual space. In the mature apographeme, every canonical text is a multitext, with dynamic editions linked to visual representations of the manuscripts, inscriptions, papyri and other sources. In the mature apographreme, each source is linked to the background data that we need to understand it – a transcription, information about the particular type of Greek or Latin script and its abbreviations, about the monastery, print shop or Egyptian village that produced it, etc.

FROM CURATED COLLECTIONS TO DYNAMIC CORPORA The methods whereby we assemble digital content are very different now from those that were available when, a generation ago, the first pioneers began designing digital tools for the humanities. In the 1970s and 1980s, most scholars considered digital resources – insofar as they considered them at all -- as instruments with which to navigate a paper sea of information. The Thesaurus Linguae Graecae (TLG) and the Dictionary of Classical Bibliography (DCB), two of the pioneering efforts within classical studies, were, in effect, indices and depended upon the ability to pay human beings to read and to type. The Thesaurus Linguae Graecae began work in 1972 and can boast in 2008 almost 100,000,000 words of cleanly transcribed Greek text.5 By the opening of the twenty-first century, of course, the technologies available began to open up very different approaches.

5

http://www.tlg.uci.edu/, accessed October 19, 2008.

330

CHANGING THE CENTER OF GRAVITY

Between 2001 and the end of 2005, one of the authors of this paper, Gregory Crane, developed a 55,000,000 word collection of 19th century American English.6 He personally scanned 400 volumes, applied OCR to the scanned page images, applied automated post-processing and shipped the results to a data entry firm. A handful of reference works with complicated formatting required traditional manual data entry but for the vast majority of this corpus the OCR-generated text provided a solid foundation and avoided the need for typing. The contractor checked for errors and added basic structural markup, tagging such elements as footnotes, quotations, figures at a cost of under $100,000 for 500,000,000 characters or about $200/book. The corpus was not an end in itself but rather an instrument with which to study problems of automatically analyzing large collections. Most of Crane’s effort went to the production of a system that could automatically identify people, places, organizations, and other named entities in unstructured text.7 That research became the foundation for a project entitled "Scalable Named Entity Services for Classical Studies" that would, with support from the National Endowment for the Humanities (NEH) and the Institute for Museum and Library Services, adapt this system for use with documents from classical studies. The situation has changed even further in the past several years. The primary medium for human intellectual life is now irrevocably digital. The most heavily funded academic disciplines use paper to print digital resources on demand. Print-only resources are now archival materials. Consider the following developments that have intensified since Crane began work on the Perseus American Collection in 2001: 1. Massive scanning. In December 2004, Google seized the initiative to create a vast library of scanned books, with text generated by OCR, but the library community has the resources to convert its print holdings into digital form: the 123 North American libraries who belong to the Association of Research Libraries

6 7

(Crane 2006a) (Smith 2001); (Crane 2005)

CRANE ET AL

331

spent more than a billion dollars on their collections in the 2005-06 academic year.8 Of course, most libraries will claim that they are under-funded and cannot maintain their existing collections, much less consider a major new initiative. Some of us are old enough to remember hearing that the costs of print collections would never allow for libraries to make digital materials accessible. 2. Scanning on demand. The OCA has created an infrastructure whereby individuals could, by 2007, select particular books for scanning and then inclusion in the larger OCA collection. The quality is high and the cost is low: $.10US per page plus handling costs of $5US per book – about $40 for a standard book with about 300 pages. It costs about the same to create a high resolution scan that anyone attached to the internet can scan than it does to buy a single printed book ($52 in 2005-06). Support from the Mellon Foundation has allowed the Cybereditions Project at Tufts University to begin creating within the OCA an open source library of Greek and Latin that will contain at least one text or fragment of every major surviving Greek and Latin author and a range of reference materials, commentaries and core publications. 3. The growth of open access and open source licensing. In 2008, open access publication became the dominant model for academic publication. The US National Institutes of Health (NIH) have established a public access policy. As of April 2008, this policy "requires scientists to submit final peer-reviewed journal manuscripts that arise from NIH funds to the digital archive PubMed Central upon acceptance for publication."9 The policy

For a list of ARL members, see http://www.arl.org/arl/membership/members.shtml, accessed August 25, 2008. For statistics from 2005-06, see (Kyrillidou 2008). 9 http://publicaccess.nih.gov, accessed August 25, 2008. 8

332

CHANGING THE CENTER OF GRAVITY further requires scientists funded by the NIH to include in their papers citations to the open access copies of previous publications in the open access PubMed Central web site. The NIH provides more than 22 billion dollars in funding for medical research.10 Publishers in the most heavily funded research area in all of academia must now develop business models that assume open access that precedes publication. The US NEH, by contrast, requested just under $145 million dollars in funding for 2009 – less than 1% as much as the NIH invests. 4. Improved OCR. Traditionally, classical Greek has been a huge barrier for classicists – there was no useful OCR. All classical Greek required manual data entry and such specialized work usually cost much more than data entry for English. By early 2008, Google had begun to generate initial OCR text from page images of classical Greek. The software is evidently based on OCR designed for modern Greek and contains errors, but clever search software could ameliorate this problem. If classicists have access to the scanned page images and can optimize OCR software for classical Greek, we can achieve character level accuracy (99.94%) comparable to the standard quality for manual data entry (99.95%). In a preliminary analysis of printed scholarly editions, we found that 13% of the unique Greek words on a page, on the average, only appeared in the textual notes. Restricting our analysis to older volumes from the Loeb Classical Library (which traditionally provided a minimal number of textual variants), we found that only 97% of the unique Greek words on a given page appeared in the main text. Thus, collections that contain perfect

10 http://nih.gov/about/index.html, accessed August 25, 2008, listed an NIH budget of $27.8 billion dollars and stated that "NIH distributes 80% of its funding to research grants."

CRANE ET AL

333

transcriptions of the reconstructed text but no textual notes offer at most 97% -- and, if we use fuller editions, 86% -- of the relevant data. The worst OCR error that we measured (98%) matches the overall recall rate of perfect transcriptions of text alone.11 Once we enter multiple editions of the same text, we can begin using each scanned edition to identify OCR errors and intentional textual variants. 5. A new generation of text mining and quantitative analysis. The DCB contains 600,000 bibliographic entries from 1949 to 2005 and adds 12,500 new items each year.12 By contrast, the CiteSeer system, upon which computer scientists depend, was developed more than a decade ago in 1997 at the NEC Research Institute in Princeton, NJ, and offers an automatically generated index of 767,000 publications, including automatically extracted bibliographic citations.13 Research continues and new generations of automated bibliographic systems, based on the automated analysis of on-line publications, have begun to appear. The Rexa System, developed at the University of Massachusetts, had assembled a collection of almost 1,000,000 publications in 2005 (Mimno 2007). David Mimno, one of the authors of this paper, is a member of that research group and has support from the Cybereditions project at Tufts University to begin in 2008-09 applying that research to publications from classics.

This work with classical Greek and OCR has been reported in (Stewart 2007). 12 http://www.annee-philologique.com/aph/, accessed August 25, 2008. 13 The number of publications indexed was taken from http://citeseer.ist.psu.edu/, accessed August 25, 2008. For basic information on CiteSeer, see http://citeseer.ist.psu.edu/citeseer.html, accessed August 25, 2008. For the initial description of the CiteSeer system, see (Bollacker 1998). 11

334

CHANGING THE CENTER OF GRAVITY

We might summarize the current situation as follows: Google has begun creating on-line a digital collection that would be more comprehensive than the greatest university libraries ever produced – and the university libraries themselves control the resources needed to do the job were Google to falter: our retrospective collections are being digitized. The OCA has created a public, scalable infrastructure whereby we can, in fact, build high quality collections within the existing library infrastructure: if massive projects miss anything, smaller efforts can fill in the gaps and create curated collections. The US government, under a conservative, pro-business administration, has made the most profitable monopolies on which publishers had depended illegal and declared open access a condition of its most generous funding agency: the richest publishers must learn to make money under open access. Advances in OCR technology have made it possible for scholars in fields such as Classics to generate very serviceable searchable text for nonstandard character sets such as Greek: once we scan editions, we can more comprehensively search primary sources and, for the first time, secondary sources that quote Greek. A new generation of text mining can provide new methods with which to trace ideas and research topics that appear in millions of publications: we can design bibliographic databases that incorporate features of particular interest to classics (e.g., the ability to determine whether "Th. 1.38" designates line 38 of Theocritus’ first Idyll, or chapter 38 of book 1 of Thucydides) with the common features of academic publication (e.g., footnotes and bibliographic citations). New services feasible in such an environment include: x Multitexts: Scholars have grown accustomed to finding whatever single edition a particular collection has chosen to collect. In large digital collections, we can begin to collate and analyze generations of scholarly editions, generating dynamically produced diagrams to illustrate the relationships between editions over time. We can begin to see immediately how and where each edition varies from every other published edition. x Chronologically deeper corpora: We can locate Greek and Latin passages that appear anywhere in the library, not just in those publications classicists are accustomed to reading. We can identify and analyze

CRANE ET AL

335

quotations of earlier authors as these appear embedded in texts of various genres. x New forms of textual bibliographic research: We can automatically identify key words and phrases within scholarship, cluster and classify existing publications, generate indices of particular people (e.g., Antonius the triumvir vs. one of the many other figures of that name, Salamis on Cyprus vs. the Salamis near Athens). Such searches can go beyond the traditional disciplinary boundaries, allowing students of Thucydides, for example, to analyze publications from international relations and political philosophy as well as classics. In this world, we need to recognize that we are – as indeed classicists have always in large measure been – corpus linguists. All classicists can articulate, in some measure, the relationship between the texts that survive and the subject that we are studying. If we work with Sophocles, we know that only seven plays survive and we have only fragments and even titles for the rest. If we study Alexander the Great, we must first understand the fact that our most comprehensive surviving Greek sources were composed centuries later and depend upon earlier histories that are now lost. Consider three topics that we might pursue in a very large collection: the usage of a Latin word over two thousand years (e.g., oratio, which can, for example, in different contexts designate a speech or a prayer), the reception of Euclid’s Elements, and the reputation of Alexander the Great. In each case, we can assemble far more information that we could ever collect in print culture. The next section will touch upon some of the services with which we can make the sprawling corpora relevant to each of these topics intellectually accessible. But even before we begin our analysis, we need to understand the limitations of the corpus that we have assembled: x How representative is the corpus? Is all of a given corpus available on-line? (e.g., have all the published volumes of a series been scanned?) Can we estimate the percentage of the corpus that survives? (e.g., what percentage of Sophocles’ do the seven plays and other fragments constitute?) What biases are inherent in our data? (e.g., do we have any accounts produced by women or by members of

336

CHANGING THE CENTER OF GRAVITY

x

every national/ethnic group involved in a topic? If we find 100,000 instances of the Latin word oratio, what are the periods, genres, locations, and (in the case of later Latin) original languages of the authors?)14 And, are there correlations between these parameters? These may in fact be automatically discernable from the data, even if the human eye doesn’t notice them in the forest of data. How accurate are the digital surrogates for each object? We may have a satisfactory corpus of print materials but these materials may yield very different results to automated services such as OCR, named entity identification, cross-language information retrieval, etc. Readers of Jeff Rydberg-Cox’s contribution to this collection will realize that OCR software will, at least in the immediate future, extract much less usable text from early modern printed editions than from editions printed in the early twentieth century. We need automatically generated metrics for the precision and accuracy of each automated process on which we depend.

SERVICES FOR THE HUMANITIES IN VERY LARGE COLLECTIONS

If humanists are to exploit large new collections to their fullest, they need, as a minimum, the following services: x Access to images of the physical sources: This includes access to particular copies of a document, any pagination or naming scheme with which to address the individual pages, and a coordinate system with which to describe regions of interest on a given page. Many born-digital publications do not provide such access – logical "page 12" of a report (as printed as a page number) may physically be page 21 of the PDF document (after adjusting for front matter, a table of contents etc.). Coordinate systems must

14 The challenges of corpus design and representativeness have been explored by many authors, including (Biber 1993) and (Douglas 2003).

CRANE ET AL

x

x

337

have sufficient abstraction so that they can address relationships of the printed page even if the paper has been cropped or varies from one printing to another: coordinates for one First Folio should be useful with others.15 Access to transcriptional data: At the least, we need to be able to analyze the words and symbols that are encoded on the physical page.16 The rough "bag-of-words" approach, where systems ignore the location of words on the page and even their word order, has proven remarkably useful. This level of service is fundamental to everything that follows. Conventional OCR software has traditionally provided no useful data from historical writing systems such as classical Greek. Latin is much more tractable but OCR software expecting English will introduce errors (e.g., converting t-u-m, Latin "then," into English tu-r-n). Even earlier books with clear print will contain features that confuse contemporary OCR (e.g., the long ‘s’ which looks like an ‘f,’ such that words such as l-e-s-t become l-e-f-t).17 Access to basic areas of a page such as header, main text, notes, marginalia: Even transcription depends upon basic page layout if it is to achieve high accuracy: we cannot transcribe individual words unless we can automatically resolve hyphenization and this in turn implies that we can distinguish multi- from single column text, footnotes, headers marginalia, etc. from main text, etc.18 We need, however, to recognize basic scholarly document layouts: thus, we should be able to search for either the reconstructed notes or the textual notes at the bottom of the

15 Research into document image analysis and retrieval within historical digital libraries is a growing area of research, for example see (Marinai 2007). 16 For some promising work in this area please see (Faure 2007). 17 For some recent state-of-the-art work in OCR for historical text collections, please see (Reynaert 2008). 18 For more on this research area, see for example, (Ramel 2007).

338

CHANGING THE CENTER OF GRAVITY

x

x

page. This stage corresponds roughly to WYSIWYG markup. At this stage, the system can distinguish the main text from the notes in Figure 5 and Figure 10 but it does not recognize that one set of notes are commentary and the other constitute an apparatus criticus. Access to visually labeled structures within the text: Explicit labeling in this case includes headwords of dictionaries and encyclopedias and canonical citations such as book/chapter/verse/line. These structures draw upon typographical conventions: e.g., bold and indenting to show headwords, numbers in the margins or embedded in the text with brackets to illustrate citations. This stage would recognize where index entries begin along with their headwords and easily recognized citations. This stage corresponds to semantically meaningful structural markup, e.g., descriptive structures about the text.19 At this stage, the system recognizes that the notes in Figure 5 are a commentary and contain comments on agro vectigali, cum et maxima … ageretur, and tibique … indicium within the text above. Access to knowledge dynamically generated from analysis of explicitly labeled knowledge: This process can begin with very coarse analysis: if we recognize when various encyclopedia articles describing several dozen figures named Antonius or Alexander begin and end, then we can analyze the vocabulary of each article to begin deciding which Alexander is meant in running text. This stage includes the lemmatization and morphological analysis to support the lookups and searches familiar to classicists for more than a decade (e.g., query fecisset and learn that it is the pluperfect subjunctive of facio, "to do, make"; query facio and retrieve inflected forms such as fecisset).20 We also need at this stage translation services (e.g., a service that

19 Recent successful efforts in extracting structural markup on a large scale from volumes within the OCA have been reported by (Lu 2007). 20 See (Schmid 2008) and (Fitschen 2008).

CRANE ET AL

x

339

determines whether a given instance of the Latin word oratio more likely corresponds to "oration," "prayer" or some other usage). At this point, knowledge based services augment general text mining (e.g., being able to cluster usages of the dictionary entry, facio, as a whole – or of facio as it is used in the subjunctive etc. – rather than treating each form of facio as a separate entity.)21 Access to linguistically labeled, machine actionable knowledge: This overlaps with the analysis of visual structures but implies a greater emphasis on the analysis of natural language, e.g., "Y, son of Z," "perf. feci" Ⱥ the perfect stem is fec-., "b. July 2, 1887" Ⱥ the subject of this encyclopedia was born in 1887 and any references to people by the same name that predate 1887 cannot describe this person,” etc. This stage corresponds to encoding information for particular ontologies, i.e., prescriptive structures separate from the text.22 At this point, we should be able to pose queries such as "encyclopedia entries for Thucydides who is son_of Olorus or has_occupation historian, etc.", "dictionary word senses is_cited_in Homer or has_voice passive;" "Book 1, lines 11-21 from all translations_of Homer_Iliad that have_language German."

Techniques exist to address all of the services outlined above. Computer scientists strive for completely general approaches and are willing to accept error rates as a cost to achieve the benefit of scalability. Traditional humanists by contrast manually analyze and, where they feel it necessary, justify the results (i.e., results that may be controversial but for which experts can make reasonable arguments) and are willing to accept labor as a cost to achieve a level of

For a recent exploration of text mining in humanities documents, please see (Don 2007). 22 The automatic markup of humanities texts with relevant ontologies (such as CIDOC-CRM) has a large and varied body of research, for some recent discussions please see (Doerr 2008) and (Lin 2008). 21

340

CHANGING THE CENTER OF GRAVITY

transparency. The grand challenge lies in integrating these two sources of energy: scholars need to be able to build on the results of automated processes but automated processes need to be able to build on scholarly data as well.

FOURTH-GENERATION COLLECTIONS We need collections that can support a core set of interlocking services. Core services such as morphological and syntactic analysis, citation identification, word sense disambiguation, word sense discovery, cross-language information retrieval, and named entity identification are, however, data-driven and, for optimal performance, require substantial amounts of carefully encoded knowledge and the largest possible bodies of unstructured data. To support these services, we need a new generation of collections. Within the humanities, we need a new, fourth generation of digital collections. While classicists have digitized texts for a generation and accurate transcriptions exist for selected editions of almost every author, we do not have the developed, scalable, sustainable knowledge base with which to represent the core primary sources that have survived to us in textual form from Greco-Roman antiquity. We have already touched upon the first generation of digital primary sources. Classicists still depend primarily upon the Thesaurus Linguae Graecae and Packard Humanities Institute collections of Greek and Latin texts, which follow designs from the 1970s. These first generation digital collections concentrated on accurate transcription of the reconstructed text with structural markup showing where works begin and end. They capture general page layout and approximate citation information: if a number in the margin of the original print edition indicates a line or section begins somewhere in the adjacent line, the human reader is left to determine where the break occurs. They do not contain any of the introductory materials, back matter such as indices and appendices or any textual notes. Second generation collections (such as those available within the Perseus Digital Library) also emphasize carefully produced transcriptions but include explicit semantic markup that follows the Text Encoding Initiative (TEI) Guidelines. These collections reflect the conditions of the late 1980s where image capture and storage remained expensive. They thus do not include page images of the original source texts and only occasionally include textual notes.

CRANE ET AL

341

Second generation collections may apply more sophisticated techniques to automate transcription and tagging but their design still assumes an expensive initial, centralized editing process with small fixes for residual errors after the initial production phase. Third generation collections, popularized by projects such as the Making of America and JSTOR, emerged in the 1990s when storage costs had declined to the point where page images for large collections of books could be kept on-line.23 Third generation systems minimize manual labor and emphasize automatic analysis of page images – especially the use of OCR software to generate searchable text. As OCR software increases in accuracy, texts can be rescanned and the searchable text can improve. First and second generation collections worked from the inside of the book outwards, focusing on subsets of printed books for digital conversion. Third generation collections by contrast, work from the outside of the book, starting with book-level library metadata that may be extended with analytical cataloguing for articles within books. All of the features that characterize the fourth-generation have existed in one form or another – our group at the Perseus Project has been developing some aspects of this plan for more than twenty years. What distinguishes fourth generation collections is the integration of a small body of data, carefully curated and laboriously structured by semi-automated or even wholly manual methods with an arbitrarily large collection for which automated analysis alone is feasible. The semantically encoded data of second generation digital collections becomes the machine actionable reference rooms from which automated systems learn how to structure the vast third generation collections of page images: x Fourth-generation collections contain images of all source writings, whether these are on paper, stone or any other medium: Like third generation collections, the Cybereditions project sets out to incorporate page images of all print originals. Our goal is to help classical scholars shift

23 Making of America: http://quod.lib.umich.edu/m/moagrp/; http://cdl.library.cornell.edu/moa/, accessed October 20, 2008. JSTOR: http://www.jstor.org/, accessed October 20, 2008.

342

CHANGING THE CENTER OF GRAVITY

x

x

the center of gravity for textual scholarship to a networked, digital environment. Scholars should not have to consult paper originals of scanned print editions to see what was on the original page. Fourth-generation collections manage legacy structures derived from physical books and pages but focus primarily upon logical structures that exist within and across pages and books: Even when fourth-generation collections depend upon page images, they exploit legacy bookpage citations but they are fundamentally oriented towards the underlying logical structures within the documents. A great deal of emphasis is placed upon page layout analysis so that we can isolate not only tables of contents, bibliographic references and indices but dictionary and encyclopedia entries and critical scholarly document types such as commentary notes and textual apparatus. Cross language information retrieval hunts for translations of primary sources. Alignment services align OCR generated text to XML editions of the same works with established structural metadata. Quotation identification services spot commentaries by recognizing sequences of quotations from the same text at the start of paragraphs. Fourth-generation collections integrate XML transcriptions of original print data as these become available: All digital editions are, at the least, re-born digital: The best work published so far cannot convert the elliptical and abbreviated conventions by which scholars represent textual data in print into machine actionable data – we cannot even reliably link the textual notes to the chunks of text which they cover, much less convert these notes into machine actionable formats so that we could automatically compare the readings from one MS against those of another. Fourth generation collections naturally integrate page images with XML representations of varying sophistication. XML representations may, like first generation collections, capture basic page layout and they may have advanced structural and basic semantic markup (e.g., careful tagging for each speaker in a play). They may encode no textual notes, textual notes as simple footnotes (free text associated with a point in the reconstructed text) or

CRANE ET AL

x

x

343

as fully machine actionable variants (e.g., variants associated with spans of source text, such that we can, among other things, compare the text in various editions or witnesses). Fourth-generation collections contain machine actionable reference materials: Our digital collections should be tightly and automatically embedded in a growing web of machine actionable reference materials. If a new prosopography or lexicon appears, links should appear between its articles and references to the people or words in the primary sources. Commentaries should align themselves automatically to multiple editions of their subject work. To the extent possible, these links should bear human readable and machine actionable information: humans should be able to see from a link what the destination is about (e.g., "Thucydides the Historian" rather than "Thucydides-3," ΦΕΛφ-"empire" rather than "ΦΕΛφsense2"). Equally important, these links should point to machine actionable information: a named entity system should be able to mine the entries in the biographical encyclopedia to distinguish Thucydides the Historian, Thucydides the mid-fifth century Athenian politician and various other people by that name; a word sense disambiguation system should be able we need systems that can to use the lexicon entries to find untagged instances where ΦΕΛφ corresponds to "empire" or "beginning." Editions should be self-collating – when a new edition of a text comes on-line, we should see immediately how it differs from its predecessors. Fourth generation collections learn from themselves: Even the simplest digital collection depends upon automated processes to generate text from page images or indices from text. Clustering and other text mining techniques discover meaning in unstructured textual data. Fourth-generation collections, however, can also learn from the machine actionable reference materials that they contain so that they apply increasingly more sophisticated analytical and visualization services to their content. In effect, they use a small body of structured data – training sets, machine actionable dictionaries, linguistic databases,

344

CHANGING THE CENTER OF GRAVITY

x

x

encyclopedias and gazetteers with heuristics for classification to find structure within the much larger body of content for which only OCR-generated text and catalogue level metadata is available. In a fourth generation collection, structured documents are programs that services compile into machine actionable code: Aeneid, book 2, line 48 in a dozen different editions already on-line as image books with OCR generated text. Fourth generation collections learn from their users: Even third generation systems depend upon the ability of OCR software to classify markings into distinct letters and words. Fourth generation systems include an increasing number of classification systems such as named entity analysis, word sense disambiguation, syntactic analysis, morphological analysis, citation and quotation identification. Where there are simple decidable answers (e.g., to which Alexandria does a particular text refer?) we want users to be able to submit corrections. Where the answers are less well-defined (e.g., expert annotators do not agree on word sense assignment and some passages are simply open to multiple interpretation), we need to be able to manage multiple annotations. Human annotators need to be able to own their contributions and readers should be able to form conclusions about their confidence in individual contributors. Automated systems need to be able to make intelligent use of human annotation, determining how much weight to apply to various contributions, especially where these conflict. We therefore need a multilayer system that can track contributions, by both humans and automated systems, through different versions of the same texts. Fourth generation collections adapt themselves to their readers, both according to specific recommendations (customization) and by making inferences from observed user behavior (personalization): Fourth-generation collections can process knowledge profiles that model the backgrounds of particular users: e.g., one user may be an expert in early Modern Italian, who has read extensively in Machiavelli, but only have a few semesters of classical Greek with which to read Thucydides and Plato. The

CRANE ET AL

x

345

fourth-generation collection can determine with tolerable accuracy what words in a new Italian or Greek text will be new and/or of interest, given the differing backgrounds but consistent research interests of the professor. At the same time, the system can infer from the reader’s behavior what other resources may be of interest. Fourth-generation collections enable deep computation, with as many services applied to their content as possible: No monolithic system can provide the best version of every advanced service upon which scholarship depends. Google, for example, has a growing number of publications about ancient Greece but currently produces only limited searchable text from classical Greek. Different groups should be able to apply various systems for morphological and syntactic analysis, named entity identification, and various text mining and visualization techniques with minimal, if any, restrictions. These groups should include both commercial service providers as well as individual scholars and scholarly teams.

THE CLASSICAL APOGRAPHEME Fourth-generation collections allow us to design corpora that go far beyond limitations that we internalized in print culture. To describe comprehensive fourth-generation collections we use the term apographeme, derived from the Greek word for copy (apographê). The apographeme echoes the term genome because an apographeme contains, in its mature form, a complete record of every surviving linguistic source for a particular corpus. For classicists, an apographeme of Greek and Latin would contain representations of every written version of every piece of writing from Greco-Roman antiquity. This includes images of every page of every inscription, papyrus, graffito, manuscript, and printed edition – the entire surviving record of the linguistic output for classical Greece and Rome – and the knowledge base whereby machines can intelligently process and humans productively decipher, insofar as existing knowledge and probing intellect can, every written word in every witness. In a library grounded on images of writing, there is no fundamental reason not to integrate, at the base level, images of writing from all surfaces. Inscriptions, papyri, and manuscripts may not be suitable fodder with which OCR software can generate useful text, but nei-

346

CHANGING THE CENTER OF GRAVITY

ther Google nor OCA can produce much useable output for even the best printed classical Greek and little, if any, useable text from early modern books. The Cybereditions project at Tufts has begun preliminary work for this massive task, focusing on the texts that have survived from Greco-Roman antiquity through manuscript tradition. These literary texts are, however, designed from the start to become part of a larger collection that will include documentary materials that survive on stone and papyrus (see Hugh Cayless's article in this collection) as well as manuscripts (see Casey Dué and Mary Ebbott's article in this collection). While developing the underlying bibliography is a major and on-going task of the Cybereditions project, we currently estimate that this apographeme would contain the following (because page images would be the first stage of collection, we use "books" as a rough initial unit of measure). Major work for the Cybereditions project will be (1) to complete a first cut of the bibliography below, (2) to begin creating the apographeme, with particular attention to the published editions, and (3) to make progress on the services that will convert these image pages into machine actionable data, with particular attention to the problem of high accuracy OCR for Greek and Latin. We will not be able to create a comprehensive apographeme for classical Greek and Latin for many years but we can establish a solid foundation from that portion of the apographeme represented by texts that have survived in manuscript tradition. The figures associated with each element reflect very preliminary estimates for broad, illustrative coverage sufficient to model a more mature system that can evolve over time. x c. 500 "book-length" authors/collections. Hundreds and thousands of ancient Greek and Latin authors survive as names or with a small number of fragments preserved in quotations of later authors or on papyrus. F. W. Hall’s Handbook to Classical Texts lists 133 entries in its survey of the "chief classical writers" – including portmanteau works with many authors (e.g., the Greek Anthology) and

CRANE ET AL

x

347

authors with very large corpora (e.g., Aristotle and Cicero).24 The Loeb Classical Library does not contain comprehensive editions for massive authors such as Galen or the early Church Fathers but its 500 volumes contain Greek and Latin texts as well as English translations for most surviving authors and works. If we assume that Galen and early church fathers would double the size of the Loeb, then we would have c. 500 volumes worth of Greek and Latin source text. Measured by word count, the corpora of classical Greek and Latin are closer to 100 and 20 million words respectively.25 c. 1,000 manuscripts (MS) and an undetermined number of papyri, many very small fragments of literary works. Based on a survey of summary data from Richard and Olivier's Repertoire des bibliothèques et des catalogues de manuscrits grecs (1995), we possess more than 30,000 medieval manuscripts that contain at least parts of Greek classical texts (there are nearly 1,200 manuscripts for Aristotle alone). Since the number of extant Latin manuscripts is conventionally assumed to be 5 to 10 times that of Greek manuscripts, there might be as many as 150,000 to 300,000 manuscripts for Classical Latin. Nevertheless, a small subset of these provide most of the textual information relevant to the authors and editions of the most commonly studied authors. Hall’s early twentieth-century Handbook to Classical Texts summarized the major MS sources for major classical authors and contains c. 650

(Hall 1913) Most surviving classical Latin was composed after antiquity. Johannes Ramminger had, as of 2008, assembled more than 200 million words of Latin in digital form (http://www.neulatein.de/, accessed October 19, 2008). The Thesaurus Linguae Latinae (TLL) is based on an archive of 10 million slips, which contain, for the older texts, a slip for each occurrence of a word (http://www.thesaurus.badw.de/english/index.htm, accessed October 19, 2008). The Packard Humanities Institute CD ROM of Latin, which is fairly comprehensive through 200CE and contains some later materials contains c. 7.5 million words. 24 25

348

CHANGING THE CENTER OF GRAVITY

x

x

x x

readily identifiable MS sigla (e.g., patterns of the form "A = Parisinus 7794") — while editors have since added additional manuscripts of importance for most authors, Hall provides a reasonable estimate for the number of the manuscripts on which our editions primarily depend. Some authors do not have a few very authoritative MSS and editors must examine large numbers of MSS of roughly equal authority, and these will inflate the total. Assuming that this list underestimates the whole by 50100%, we are still left with the evidence that a database of 1,000 MSS would represent the majority of textual knowledge preserved for us by MS transmission. c. 5,000 major editions over the five centuries extending from the editiones principes of the early modern period to the start of the twenty-first century. Assuming at the high end that each author has c. 10 volumes worth of major editions. Multi-work canonical authors will have many editions of individual and selected works. At the very high end, the New Variorum Shakespeare series chooses c. 50 editions of each play as worth collation and this may represent an upper bound for canonical texts outside the Bible. c. 5,000 translations in European languages such as English, French, German and Italian. These are important because we can use parallel text analysis to infer translation equivalents and word senses and then use advanced language services (e.g., syntactic analysis, named entity analysis) on the translations and then project this backwards onto the original. Such a technique can, for example, add 15% to our current ability to analyze Latin syntax (e.g., from 54% to 70%). c. 5,000 modern commentaries, author lexica etc. These are useful for human readers and may lend machine actionable data as well. c. 1,000 general reference works such as lexica, grammars, encyclopedias, indices and other entry/labeled paragraph reference works with high concentrations of citations and, in some cases, elaborate knowledge bearing hierarchical structures.

CRANE ET AL x

349

c. 1000 specialized studies of Greek and Latin language in a sufficiently structured format for high precision information extraction.

THREE TECHNICAL CHALLENGES The implications of very large collections for the humanities are profound. We can transform existing research agendas, render content physically and intellectually accessible to new audiences and make human inquiry possible over barriers of language, culture and sheer volume. An immense amount must be – and is being – done. Within this context, we offer three strategic areas of development that are both essential for the humanities and are not, to our knowledge, currently covered by industrially driven research. These areas of interest include the need to transform page images into machine-readable text, machine readable text into machine actionable knowledge, and text from one language into another. Each of these areas of development reflects the particular needs of humanities scholarship and would benefit from targeted support. x Leverage the fact that many historical texts quote documents for which excellent transcriptions exist in machinereadable form. Thus, the tenth century Venetus A manuscript (Figure 1 and Figure 2) and Jensen’s 1475 incunabulum (Figure 3 and Figure 4) contain texts of Homer and Augustine. We need systems that can use their knowledge that a given document represents texts for which transcriptions exist to decode the writing system of the document, to separate text from headings, notes, and others annotations, to recognize and expand idiosyncratic abbreviations of words within the text, to distinguish variants from errors, and to provide alignments between the transcribed text and their probable equivalents on the written page. Even if we only succeed in general alignments between a canonical text and sources such as early modern printed books and manuscripts, the results will be

350

CHANGING THE CENTER OF GRAVITY

x

significant.26 If we can improve our ability to collate manuscripts or extract useful text from otherwise intractable sources, the results will be powerful. This task requires very different OCR technology from that currently in use. In this case, we assume that our texts contain many passages for which we possess good transcriptions. The problem becomes (1) finding those quotations, (2) learning what written symbols correspond to various components of transcription, and (3) comparing multiple versions of the same passage to distinguish variants and errors. The OCR system uses a library of known texts to learn new fonts, idiosyncratic abbreviations and even handwriting. There are two measures for this category of OCR. First, there is the overall character accuracy of transcriptional output from documents that the OCR software produces by training itself with recognized quotations. Second, the ability to locate quotations of existing texts is an important scholarly task in and of itself.27 Two of the prime tasks in the German eAqua Classics Text Mining Project focus on identifying undiscovered quotations of Plato and of Greek Fragmentary Historians.28 The apparatus criticus for the Ahlberg Sallust (Figure 9 and Figure 10), for example, includes not only textual variants but testimonia – places where later authors have quoted Sallust. Such manually constructed lists of testimonia provide us with instruments with which to measure precision and recall for automated methods. Use propositional data already available to decode the formats in which unrecognized knowledge has been stored.

(Cheng 2001); (Cheng 2002) Google Books offers a "popular passages" feature that seeks to identify and link quotations, work that was recently reported in (Schilit 2008). 28 http://www.eaqua.net/, accessed October 20, 2008 26 27

CRANE ET AL

351

Printed reference works contain an immense body of information that can be converted into machine actionable knowledge. The Perseus Digital Library, to take one example, has tagged hundreds of thousands of propositional data within reference works originally published on paper. Thus, the Liddell-Scott-Jones Greek-English (Figure 13 and Figure 14) and Lewis and Short Latin-English lexica, for example, contain tagged citations to 422,000 Greek and 303,000 Latin authors (i.e., citations tagged with author numbers from the TLG and PHI canons of Greek and Latin authors). Since the structure of the dictionary articles has also been tagged, many of these citations represent propositional statements of the form SENSE-M of DICTIONARY-WORD-N appears in CITATION-P of AUTHOR-Q. The works of many Greek and Roman authors survive only insofar as other authors have quoted or described them. These fragmentary texts are published as lists of excerpts (Figure 12). Thus, fragment 116 of the historian Ephorus in Mueller’s edition contains an excerpt from chapter 12 of Plutarch’s Life of Cimon. Each of which represents the propositional statement "EXCERPT-A frm CITATION-C of AUTHOR-D refers to (fragmentary) AUTHOR-X." Note that not all citations refer to the author: thus, fragment 113 of Ephorus includes a crossreference for background information on a historical event in Herodotus, who wrote before Ephorus. Grammars also contain well-structured information: citations within a section on contrary to fact conditionals, for example, (Figure 15 and Figure 16 through Figure 18) can be converted into propositional form: e.g., GRAMMATICAL-STRUCTURE CONTRARY-TO-FACT occurs at Xenophon’s Cyropaedia, book 1, chapter 2, section 16. Fine-grained analysis of the print content can also extract quotations and their English translations that appear throughout reference grammars and lexica. Smyth’s Greek Grammar, the German Kühner-Gerth reference Greek Grammar, and the Allen and Greenough Latin Grammar contain 5,300, 21,000 citations and 2,000 tagged citations within labeled sections

352

CHANGING THE CENTER OF GRAVITY Citations in indices of proper names and in encyclopedias about people and places provide similar propositional data to disambiguate references to ambiguous names: thus, the print index to Rawlinson’s Herodotus (Figure 20) distinguishes passages where Herodotus cites Alexander, a king of Macedon, from Alexander, the son of Priam who appears in the Trojan War. Encyclopedias (Figure 22) contain citations from many different sources and many different people and places with the same name. By converting the citations to links and then extracting the contexts in which different Alexanders appear, machine learning algorithms can be used to find patterns with which to distinguish one Alexander from another elsewhere. The Smith’s biographical and geographical dictionaries contain 37,000 tagged citations for 20,000 entries on people and 26,000 tagged citations for 10,000 entries on places. The Perseus Encyclopedia, integrating entries from originally separate print indices contains 69,000 citations for 13,000 entries. A great deal of information remains to be mined from the print record and we need to be able to leverage the information already extracted to extract even more from the much larger body of reference materials available only as page images. Extraction contains at least two dimensions. In each case, we need more scalable methods: Parsing the structure of individual documents: Even if we can recognize that "Th. 1.33" represents a citation to a text, we need to determine whether this cites book 1, chapter 33 of Thucydides’ Peloponnesian War or Idyll 1, line 33 of Theocritus. The indices shown at Figure 20, Figure 21, Figure 15, etc. illustrate some of the varying formats with which different works encode similar information Aligning information from different documents: Author indices distinguish different people and places with the same name in the same document, but aligning information from multiple author indices is not easy. Is Alexander the son of Amyntas

CRANE ET AL

x

353

in Herodotus the same person as Alexander the father of Perdiccas in Thucydides? Use existing translations of source texts to generate multilingual services such as cross language information retrieval, word sense disambiguation and other searching/translation services. There are already English translations aligned by canonical citation to more than 5,000,000 words of Greek and Latin available in the Perseus Digital Library. These provide enough parallel text to support basic multi-lingual services such as contextualized word glossing (e.g., recognizing in a given context whether oratio is more likely to correspond to "prayer," "oration" or some other word sense), cross language information retrieval (e.g., being able to generate "prayer" and "oration" as possible English equivalents of Latin oratio), and semantic searching (e.g., find all Latin and Greek words that probably correspond to the English word "prayer" in particular passages). The larger our collections of parallel text and translation, the more powerful the services can become. We need methods to locate more translations of Greek and Latin and then to align these with their sources. In some cases, library metadata will allow us to identify translations of particular Greek and Latin works. In other cases, however, we will need to depend upon cross-language information retrieval to find translations where no machine actionable cataloguing exists (e.g., anthologies, quotations of excerpts or smaller works). Once we have identified a translation, we need automated methods to align translation and text. Figure 11 shows a best case scenario: a book where the modern translation and classical source text are printed side by side. In this case, the modern translation shares the chapter number of the Latin source text (both have "LXIV" to indicate that they include chapter 64), but the English translation does not include the finer grained section numbers in the Latin text. We need automated methods to align the many translations now appearing in large image book collections.

354

CHANGING THE CENTER OF GRAVITY

CONCLUSION Comprehensive collections of industrially scanned written materials provide historic new instruments with which to better understand and to make intellectually accessible the record of human existence. These comprehensive collections of scanned print materials are, however, not an end in themselves but instead provide the foundation on which new collections, integrating images of writing with machine actionable data, will support a new generation of services for a new generation of intellectual projects.

APPENDIX: SAMPLE PAGE IMAGES Primary Sources

The 10th Century Venetus A MS of Homer

Figure 1. The 10th Century Venetus A MS of Homer: U4 (Allen): Marcianus Graecus Z. 458 (= 841) - the back (verso) of folio 15 (available under a Creative Commons license from Harvard’s Center for Hellenic Studies: http://chs.harvard.edu/chs/manuscript_images). The knowledge based OCR project recommended in this report would allow us to work with manuscripts as well as printed materials.

CRANE ET AL

Figure 2. Detail of the Venetus A showing scholia and text.

The 1475 Jensen printing of Augustine’s De Civitate Dei

Figure 3. A page from Nicholas Jensen’s 1475 printing of Augstine’s De Civitate Dei available for public

355

356

CHANGING THE CENTER OF GRAVITY download from the Open Content Alliance (http://www.archive.org/details/augustinidecivitatedei 00jensuoft/).

Figure 4. Detail of Jensen's Augustine.

CRANE ET AL

Tyrrell’s Edition of Cicero’s Letters

Figure 5. Tyrell’s text and commentary of Cicero’s Letters.

357

358

CHANGING THE CENTER OF GRAVITY

Figure 6. A detail showing Tyrrell’s commentary on the page above.

CRANE ET AL

Figure 7. Textual notes in Tyrrell stored in an appendix rather than at the bottom of the page.

Figure 8. Abbreviations used in the textual notes and commentary.

359

360

CHANGING THE CENTER OF GRAVITY

Ahlberg’s 1919 Edition of Sallust

Figure 9. The opening of Sallust’s Catiline in Axel Ahlberg’s 1919 Editio Major.

CRANE ET AL

Figure 10. The apparatus criticus from the page above.

Translations

Figure 11. Facing Latin text and English translation (R. O. Foster, from the first volume of the Loeb Classical Library Livy series (Cambridge 1919)).

361

362

CHANGING THE CENTER OF GRAVITY

Editions of Fragmentary Authors and Works

Figure 12. Typical page from Mueller's Fragmenta Graecorum Historicorum. Above we see an edition of a fragmentary Greek author – quotations of and allusions to

CRANE ET AL the Greek historian Ephorus, whose works have been lost. Each fragment contains one or more citations to works that provide information about a particular passage in Ephorus. The format is Fragment number – Citation – Excerpt. Latin translations of the Greek excerpts appear at the bottom of the page.

Reference works

Lexica Lidell Scott Greek-English Lexicon

Figure 13. A typical page from an edition of the Liddell Scott Greek English Lexicon (available from the Open Content Alliance: http://www.archive.org/details/greekenglishlex00liddr ich/).

363

364

CHANGING THE CENTER OF GRAVITY

Figure 14. Detail from the Liddell-Scott Greek-English Lexicon

Grammars Goodwin and Gulick’s Greek Grammar

Figure 15. A typical page from Goodwin and Gulick.

CRANE ET AL

Figure 16: A paragraph with number and alphabetic section from the section on contrary to fact conditionals. This paragraph happens to appear on page 297, but the proper reference would be to paragraph 1410a.

365

366

CHANGING THE CENTER OF GRAVITY

Rutherford’s First Greek Syntax

Figure 17. A page from Rutherford’s First Greek Grammar (downloaded from Google Books).

CRANE ET AL

Figure 18: The index to Rutherford’s First Greek Grammar: note that citations point to the numbered paragraphs rather than the page numbers. The index ap-

367

368

CHANGING THE CENTER OF GRAVITY pears at the end of the book and an automated system could infer that pages were not the citation scheme because almost all of the numbers in the text above are greater than the current page (174).

Information about People, Places, Organizations and other Named Entities

Figure 19. Section from the index to the Loeb Edition of Thucydides. In this case, the index uses the canonical book/chapter/section citation scheme, using upper case Roman numerals for books, lower case Roman numerals for chapters and Arabic numbers for sections.

CRANE ET AL

Figure 20. Index to Rawlinson’s Herodotus. In this case, the citations point to the particular volumes and page numbers of this translation rather than to the conventional book and chapter references. These references are, however, in the original pages and we could convert the idiosyncratic citations above to a more standard format by checking vol. 3, page 187, for example, to determine that Alexander appears in Herodotus, book 5, chapter 17.

Figure 21. Page from vol. 1 of Smith’s Dictionary of Greek and Roman Biography (1848).

369

370

CHANGING THE CENTER OF GRAVITY

Figure 22. Detail from Smith’s.

CRANE ET AL

Figure 23. Detail from the article on Alexander I from Smith’s Dictionary above.

371

372

CHANGING THE CENTER OF GRAVITY

BIBLIOGRAPHY Biber 1993 Biber, D. "Representativeness in Corpus Design." Literary & Linguistic Computing, 8:4 (1993), pp. 243-257. Bollacker 1998 Bollacker, K.D., S. Lawrence, and C.L. Giles. "CiteSeer: an Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications." Proceedings of the Second International Conference on Autonomous Agents, 1998: pp. 116-123. Cheng 2001 Cheng, J. Y. and W. B. Seales. "Guided Linking: Efficiently Making Image-to-Transcript Correspondence". Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001: p. 471. Cheng 2002 Cheng, J. Y. "Extensible Tools for Building and Using Digital Library Collections". PhD Dissertation. UK Department of Computer Science. 2002. Director, Brent Seales. Crane 2005 Crane, G. and A. Jones. The Perseus American Collection 1.0 http://www.perseus.tufts.edu/~gcrane/americancoll.12 .2005.pdf (2005). Crane 2006 Crane, G. "Comments on the 'APA Task For on Electronic Publications: Issues and Recommendations for Discussion' (draft of October 20, 2006)." http://www.stoa.org/varia/apacomments.pdf. Crane 2006a Crane, G. "The Perseus Digital Library: New Content and Services for 19th century American Documents". D-Lib Magazine, 12:3 (2006), http://www.dlib.org/dlib/march06/03featuredcollection.html. Crane 2006b Crane, G. "What Do You Do with a Million Books." D-Lib Magazine, 12:3 (2006), http://www.dlib.org/dlib/march06/crane/03crane.htm l. Crane 2008 Crane, G. and A. Friedlander. Many More than a Million: Building the Digital Environment for the Age of Abundance. Report of a One-Day Seminar on Promoting Digital Scholarship Sponsored by the Council on Library and Information Resources. November 28, 2007 Final Report. March 1, 2008.

CRANE ET AL

373

http://www.clir.org/activities/digitalscholar/Nov28fin al.pdf. Doerr 2008 Doerr, M. and D. Iorizzo. "The Dream of a Global Knowledge Network—A New Approach." Journal of Computing and Cultural Heritage, 1:1 (2008), pp. 1-23. Don 2007 Don, A. et al. ‘Discovering Interesting Usage Patterns in Text Collections: Integrating Text Mining With Visualization.” Proceedings of CIKM 2007, pp. 213-222. Douglas 2003 Douglas, F. M. "The Scottish Corpus of Texts and Speech: Problems of Corpus Design." Literary & Linguistic Computing, 18:1 (2003), pp. 23-37. Faure 2007 Faure, C. and N. Vincent. "Document Image Analysis for Active Reading". In SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexing, pp. 7-14. Fitschen 2008 Fitschen, A. and P. Gupta. (in print). "Lemmatising and Morphological Tagging.". In: Lüdeling, Anke & Kytö, Merja (eds) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin: 2008. Hall 1913 Hall, F. W. A Companion to Classical Texts. Oxford: Clarendon Press, 1913. Kyrillidou 2008 Kyrillidou, M. and M. Young. "ARL Statistics 2005-06: a compilation of statistics from the one hundred and twenty-three members of the Association of Research Libraries". Retrieved 06/29, 2008, from http://www.arl.org/bm~doc/arlstats06.pdf. Lavoie 2005 Lavoie, B., et al. "Anatomy of Aggregate Collections: The Example of Google Print for Libraries." D-Lib Magazine, 11:9 (2005), http://www.dlib.org/dlib/september05/lavoie/09lavoi e.html. Lin 2008 Lin, C. et al. "Issues in an Inference Platform for Generating Deductive Knowledge: a Case Study in Cultural Heritage Digital Libraries using the CIDOC-CRM." International Journal on Digital Libraries, 8:2 (2008), pp. 115132. Lu 2007 Lu, X. et al. "Intelligent Parsing of Scanned Volumes for Web Based Archives." In Proceedings of ICSC 2007 (International Conference on Semantic Computing), pp. 559-568.

374

CHANGING THE CENTER OF GRAVITY

Marinai 2007 Marinai, S. E. Marino, and G. Soda. "Exploring Digital Libraries with Document Image Retrieval." Proceedings of ECDL 2007, pp. 368-379. Mimno 2007 Mimno, D. and A. McCallum. "Mining a Digital Library for Influential Authors." Proceedings of the 2007 Joint ACM-IEEE Conference on Digital Libraries, pp. 105-106. http://www.cs.umass.edu/~mccallum/papers/authorsjcdl07.pdf. Ramel 2007 Ramel, J. et al. "User-Driven Page Layout Analysis of Historical Printed Books". International Journal on Document Analysis and Recognition, 9:2-4 (2007), pp. 243-261. Reynaert 2008 Reynaert, M. "Non-interactive OCR PostCorrection for Giga-Scale Digitization Projects." Computational Linguistics and Intelligent Text Processing (2008), pp. 617-630. Schilit 2008 Schilit, B. and O. Kolak. "Exploring a Digital Library Through Key Ideas." Proceedings of JCDL 2008, pp. 177186. Schmid 2008 Schmid, Helmut. (in print). "Tokenizing and Part-ofSpeech Tagging." In: Lüdeling, Anke & Kytö, Merja (eds) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin: 2008. Smith 2001 Smith, D.A. and G. Crane. "Disambiguating Geographic Names in a Historical Digital Library." In ECDL 2001: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, pp. 127-136. Stewart 2007 Stewart, G., G. Crane, and A. Babeu. "A New Generation of Textual Corpora: Mining Corpora from Very Large Collections." In Proceedings of the 7th ACM/IEEECS joint conference on Digital libraries 2007, pp. 356-365. http:/dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.001.0 0006.

CONCLUSION: CYBERINFRASTRUCTURE, THE SCAIFE DIGITAL LIBRARY AND CLASSICS IN A DIGITAL AGE CHRISTOPHER BLACKWELL FURMAN UNIVERSITY [email protected]

GREGORY CRANE TUFTS UNIVERSITY [email protected]

ABSTRACT We can already begin to envision research projects that were scarcely, if at all, feasible in print culture. The papers in this collection allow us as well to enumerate the services and publication types on which emerging scholarship depends. We also need models for publication that meet the needs and realize the potential of the digital media and we describe here the Scaife Digital Library, a concrete example of true digital publication. I look upon the discontent of the literary class, as a mere announcement of the fact, that they find themselves not in the state of mind of their fathers, and regret the coming state of mind as untried, as a boy dreads the water before he has learned that he can swim. If there is any period one would desire to be born in,–is it not the age of Revolution; when the old and the new stand side by side, and admit of being compared;

375

376

CHANGING THE CENTER OF GRAVITY when the energies of all men are searched by fear and by hope; when the historic glories of the old, can be compensated by the rich possibilities of the new era? This time, like all times, is a very good one, if we but know what to do with it. (Emerson, The American Scholar). Every human individuality is an idea rooted in actuality, and this idea shines forth so brilliantly from some individuals that it seems to have assumed the form of an individual merely to use it as a vehicle for expressing itself. When one traces human activity, after all its determining causes have been subtracted there remains something original which transforms these influences instead of being suffocated by them; in this very element there is an incessantly active drive to give outward shape to its inner, unique nature. (Wilhelm von Humboldt, "Lecture to the Prussian Academy," 1821)

When Emerson addressed Harvard’s Phi Beta Kappa Society in 1837, slavery was still an established institution and those who in Massachusetts favored its abolition, such as William Lloyd Garrison, were the dangerous radicals of their day and those who, like the author Lydia Maria Child, suggested racial equality found the doors of polite society slamming shut in their faces. Many twentyfirst century readers will note the linguistic assumption that scholars are boys, fathers and men. Revolution has its own logic and revolutionaries should never forget that the critical pose which they apply to the present and the past will turn itself upon them when they have themselves passed into history – if, of course, they are so fortunate as to touch the historical memory of succeeding generations. If, in decades and generations to come, students of the ancient world read these words, we cannot now say where they may pause to wonder at how prescient the members of this early generation had been or where they may cringe and squirm. But all of those who contributed to this collection have dedicated their lives to a love for the past and that love allows us to embrace the future. The authors of this collection cannot predict what course events will assume or how they will appear to those who follow, but they have recognized the revolution of their own time and all have taken action to carry this revolution forward. Emerson does not really define the title of his talk, but for those of us who contributed to this collection, whether we happen

BLACKWELL AND CRANE

377

to live in United States or not, Ross Scaife embodied the best qualities that a phrase such as the "American Scholar" might suggest. Of course, Ross happens to have lived his life in the United States: born and educated in Virginia, trained as a scholar in Texas, Ross fashioned a home in Kentucky – and the many who had the privilege of visiting that home know how much literal truth there is in that statement. But Ross was a man unmoved by social convention or established authority. For him, the future was one of boundless possibility and the past was not a burden, but a foundation on which to build. He feared neither change nor continuity but evaluated each on its own merits according to the values that had grown strong within his heart. And like every true scholar from every nation and every period, he loved both grand ideas and the people around him with equal warmth. A generation from now, the course that classical studies and the humanities in general have taken may seem to have been a natural outgrowth of the early twenty-first century. And, indeed, we cannot say to what extent the larger forces at work within society may constrain the shape that our field will assume. But those of us who knew Ross also saw a man who anticipated far ahead of his fellows the importance of making our ideas accessible to the widest possible audience. The original proposal that secured funding to the Stoa called for a new generation of publications that would be designed from the start to be intellectually as well as physically accessible to an audience far beyond the narrow channels of twentieth century academic discourse. Blackwell and Martin in this collection articulate how this vision was, in fact, realized: Stoa publications such as Blackwell’s Demos1 bring the broader public directly into contact both with his interpretation of Athenian democracies and with the primary sources on which his interpretations are based. Ross was among the first to recognize the importance of making our publications fully open – it is not enough to provide a single perspective via a single web site with primary and secondary sources. We need to make the source materials accessible – others

1

http://www.stoa.org/projects/demos/home

378

CHANGING THE CENTER OF GRAVITY

need to be able to download what we produce, apply their own analytical methods, and even build new derivative works on what others have done. It is already difficult for us to remember how radical and far-sighted Ross was years ago. He had the vision to see what was obviously wrong at the time but would become obviously correct in the future. Ross embodied that profound originality that Humboldt describes in those who produce the times of which we are all products. In this conclusion, we synthesize some of the themes outlined and work described in the previous papers. We recall the categories of ePhilology and eClassics, first discussed in the introduction, and use these two categories to characterize two fundamental advances now becoming possible: our ability to begin increasingly complex intellectual projects with greater command of the underlying data and to answer finally the challenge, articulated in Plato’s Phaedrus, that written words cannot explain themselves. We then shift to describe some of the basic services and collections that must be a part of any Cyberinfrastructure for classics and humanities. From there, we list the requirements for publication for a Cyberinfrastructure in which automated systems and broad based communities interact in novel, complex ways with our primary and secondary sources. We then describe the Scaife Digital Library (SDL), an open effort that integrates primary and secondary sources, has an immediate core of classical materials but also can manage content from many disciplines, and embodies more broadly and perfectly than any other effort with which we are familiar the needs of advanced research. While the SDL represents, in our view, a major step forward for classical studies and ultimately, we hope, for other disciplines, the SDL builds directly upon foundations that Ross Scaife laid over his decade of work on the Stoa Publishing Consortium.2

OPPORTUNITIES: EPHILOLOGY AND ECLASSICS And one day they taught Hesiod glorious song while he was shepherding his lambs under holy Helicon, and this word first

2

http://www.stoa.org/

BLACKWELL AND CRANE

379

the goddesses said to me — the Muses of Olympus, daughters of Zeus who holds the aegis: Shepherds of the wilderness, wretched things of shame, mere bellies, we know how to speak many false things as though they were true; but we know, when we will, to utter true things. So said the ready-voiced daughters of great Zeus, and they plucked and gave me a staff, a shoot of sturdy laurel, a marvelous thing, and breathed into me a divine voice to celebrate things that shall be and things that were before; and they bade me sing of the race of the blessed gods that are eternally, but ever to sing of themselves both first and last. (Hesiod, Theogony 21-34, after Evelyn-White)

The Muses gave Hesiod a staff, and for the poet that is enough – few, if any, have produced poetry that has exerted such a spell over so many people from so many periods of time and disparate cultures as have the works of Hesiod and the Homeric Epics. All of us who live the life of the mind, whether we are poets or professors, follow our Muses. The staff that we have now taken into our hands is still rough and we are learning its balance and heft, but already we can begin to glimpse the stories that we will be able to see when the inspiration of our new muses takes full hold. The introduction to this collection distinguished two goals within a digital world. On the one hand, ePhilology emphasizes the role of the linguistic record in producing and organizing ideas and information about the ancient world. We use eClassics, by contrast, to describe Greek and Latin languages and literatures, wherever and whenever produced, as they live within our physical brains, touch our less tangible hearts and shape our actions in the world around us. We return now to these topics, suggesting how a Cyberinfrastructure, including both comprehensive collections and advanced, domain optimized services, can advance each of these goals. Memographies allow philologists to explore vast topics far too large for individual scholars in print culture. Plato’s challenge allows us to appreciate the magnitude of the opportunities before us now, as we can finally begin to address a critique of the static written word that is more than two thousand years old.

380

CHANGING THE CENTER OF GRAVITY

ePhilology and Memographies My mother Thetis tells me that there are two ways in which I may meet my end. If I stay here and fight, I shall lose my safe homecoming but I will have a glory that is unwilting: whereas if I go home my glory will die, but it will be a long time before the outcome of death shall take me. (Achilles’ choice, Homer, Iliad 9.410-416, tr. Butler/Nagy)

It is easy to see how we can, in a digital environment, pursue our research topics more extensively than was previously possible. We have also described how we can make the sources of antiquity intellectually accessible to new audiences. We now turn to the question of what research questions we can pursue that would not have been feasible without collections that are, if not exhaustive, at least large enough to be representative of the published record available in print. Consider a monolingual printed corpus such as English language newspapers in the 19th century United States. The 1869 Rowell Newspaper Directory3 for the United States and Canada lists more than 5000 newspapers that were printing more than 20,000 unique pages a week and thus more than 1,000,000 pages per year. If we take 5,200 pages of one newspaper (the Civil War era Richmond Times Dispatch4) as a rough indicator of words on a typical newspaper page (c. 5,000), North American English language newspapers printed perhaps 50 billion words each year in the late 1860s. If we simply analyzed these newspapers, we could open up whole new lines of inquiry, tracking a range of topics: Which newspapers reprinted stories from which? What sorts of things did people say in newspapers from different parts of the country with different party

3 George P. Rowell and Company’s American Newspaper Directory. New York: Geo. P. Rowell & Co. 1869. http://www.perseus.tufts.edu/hopper/text.jsp?doc=Perseus:text:2001.05. 0301 4 See http://www.perseus.tufts.edu/hopper/collection.jsp?collection=Perseus:c ollection:RichTimes

BLACKWELL AND CRANE

381

affiliations about slavery over the course of time? What poetry and fiction appeared in these newspapers? What products were advertised? All of these are eminently tractable problems: we don’t need perfect transcriptions or perfect services to begin identifying the trends behind these topics. If we begin to think about 19th century newspapers in other languages around the world, the challenges and opportunities become even greater. Clearly we can begin to pursue topics that require analysis of much more data than any human being can see, much less contemplate. We can begin to trace topics that have a life in human tradition that goes beyond any single period or immediate context. Such topics have lives of their own. We can now write histories or (to pursue the metaphor of living things) biographies of these topics. The geneticist Richard Dawkins coined the term meme in 1976 to describe the cultural counterpart to biological genes: memes include any thoughts or behaviors that can be passed from one person to another and examples include "thoughts, ideas, theories, gestures, practices, fashions, habits, songs and dances."5 The term meme provides a useful concept because it stresses the autonomy of ideas as they circulate through our biological brains and storage technologies. The concept of a meme allows us to consider both information about a historical topic that existed in the material world (e.g., the life of the historical Alexander the Great) and topics that have a life of their own (e.g., Alexander as a hero of Iranian folk tales). We use the term memography to describe the history of a meme within a larger body of material. While the term meme may be new, the underlying concept is not. The continuous tradition of European literature begins with the Iliad and Achilles’ choice. He knows from his mother, the goddess Thetis that he may choose a long but unremarkable life, soon forgotten, or he may die young but win undying fame. Undying fame means that others will speak about him and what he accomplished at Troy forever. He can trade life as a biological entity for

5 (Dawkins 1976).The above is drawn from the Wikipedia entry on the topic: http://en.wikipedia.org/wiki/Meme (accessed August 23, 2008).

382

CHANGING THE CENTER OF GRAVITY

life as an idea that the songs of the epic tradition will pass from one mind to another. The physical Achilles will pass sooner or later. But this new entity – this object of thought and memory – will never die. Achilles knows ahead of time that his death would secure for him the goal of all great heroes: he would become a meme – a meme that succeeded because it jumped from the medium of oral transmission into a network of material information technologies. The biological Plato, likewise, vanished more than two thousand years ago but his writings have been copied ever since and the historical Plato continues to exist as the topic of discourse. Scholars could, in print culture before the advent of searchable texts, laboriously track down many Platonic testimonia, e.g., the explicit quotations and most obvious allusions to particular passages in Plato. German classicists have begun to apply text mining algorithms to search for quotations and allusions that previous generations missed.6 If we wanted to understand the role of Plato and the ways in which others have quoted and used his dialogues, we would need to work in every language where Plato was influential. This would include not only such common languages of classical philology as Latin, English, French, German and Italian, but virtually every European language that left behind a substantial body of written discourse. If we then consider that Plato has had a major presence within Islamic thought and realize that we will need to consider Arabic and Persian as well, it quickly becomes clear that no single scholar can create from the primary sources a global overview of Plato’s influence from antiquity through the present. The nineteenth-century newspapers mentioned above present just another component from the sources that shed light on who said what about Plato. In an age of very large collections, we can, however, begin to design systems that will provide automatic visualizations of topics such as Plato and Plato’s works. x Named entity analysis finds passages that refer to Plato the philosopher, filtering out those passages that refer to

6 For more on the eAQUA project, see their website http://www.eaqua.net/ (accessed October 1, 2008).

BLACKWELL AND CRANE

383

other figures of the same name (e.g., the Athenian Comic poet named Plato). x Quotation identification finds direct quotations and paraphrases of passages in Plato. x Cross language information retrieval extends named entity and quotation identification to multiple languages (e.g., Arabic, Chinese, Latin, English, French, German, Italian, Russian and other languages for which major cross-lingual resources are available). x Text mining identifies words and phrases that appear in conjunction with references to and quotations of Plato. These words and phrases allow us to discover common ideas associated with Plato across different genres and periods. x Machine translation links similar words and phrases associated with Plato in multiple languages, identifying crosslingual cultural units. x Visualization systems allow readers to track, for example, where and how often Plato’s Republic has been discussed, what passages have been most examined, and what sorts of things people have said about Plato, whether in Berlin or the Iranian university city of Qom. x Customization and personalization services then provide individual analysts with relevant materials in languages that they understand as well as machine translation and interactive translation support services to help them with languages in which they have little or no fluency. Thus, the system might present scholars of Islamic thought with translations of Plato and translation support geared to their particular knowledge of Greek. Each of the above and similar processes is analogous to the sensors by which scientists track data in the material world. Each of the above processes will produce noise as well as a usable signal. The results will not, of course, be scholarship, but rather data within which patterns can emerge to stimulate scholarship – in the end, human beings will have to contemplate what the systems have found. They will refine the questions that they ask, contemplate the results again, and then repeat their analysis in an iterative process. But, despite all the noise within the system, we will quickly start to

384

CHANGING THE CENTER OF GRAVITY

see patterns about who has said what at various times about which passages of Plato in a variety of languages. If we consider established genres of reference work such as lexica, grammars, encyclopedias and editions, we can see that a wide range of topics constitute memes that we could now begin to study. x People and places: Any major person (Shakespeare, Abraham Lincoln) and place (Rome, Athens) has a history within human imagination that goes far beyond anything that we could analyze with traditional means. x Languages: Few scholars study Latin much as they may wish to: no one can become sufficiently familiar with all the communities who wrote in Latin for more than 2,000 years to describe the language as a whole. We can, however, already begin to track patterns of syntax, style, and lexicography as they change in different genres and periods. x Abstract concepts: Some concepts aggressively attempt to transcend language and cultural barriers. Thus official Catholic doctrines are in theory designed to be comprehensible to any speaker of any language. The Pythagorean theorem both points to a mathematical concept and comprises a metaphor for mathematical knowledge with its own history. x Texts: The Greek texts of Plato’s Republic and the Christian New Testament are both textual entities that have their own existence, in an open ended set of language versions, as complete versions, as the source for quotation, and as a foundation for allusion. No one will ever be able to see, much less read and contemplate over time, the primary sources underlying broad topics such as the history of Latin over two thousand years or even the reception of Plato. Of course, this is hardly new: no living humanist publishing on major canonical authors such as Homer or Shakespeare can claim to have read and pondered more than a subset of conventional published scholarship in the conventional languages of European and American scholarship. But the rise of large collections and emergent systems with which to analyze those collections allows us to shift our stance away from the limits of what we can read with our two eyes and towards the challenges of working with

BLACKWELL AND CRANE

385

machines that can scan large bodies of material and then (as we will see through the discussion of Plato’s challenge below) allow us to focus in detail on passages in more languages and from more contexts than was possible before. A memography contains elements that are deeply traditional in form and general purpose, even if it represents an engagement between author, reader and source materials so quantitatively broader in scope as to constitute a radical change. Demos has only begun to adapt the scholarly monograph to a digital form but it already illustrates the increased connection between argument and primary source that we expect from a memography: Demos covers a major topic – Athenian Democracy – which had grown so heavily worked that many publications on the topic stopped providing direct citations to the primary sources on which their conclusions were based. Demos provides, wherever possible, not only citations to the primary sources on which each statement is based but also explanatory information – briefing materials, in effect – to support critical analysis of the sources once found. The history of Athenian democracy is a tractable subject. The history of the reception of Athenian democracy that includes Islamic and Western views of Athenian democracy is a memography. Thomas Martin’s Overview of Greek Civilization (also published separately as a book by Yale University Press)7 was published as an on-line product for the original Perseus CD ROM publications and constitutes a subject so vast as to justify a memography. A memography, in effect, applies the same principles to even larger topics and immediately requires automated methods. Demos exploited the digital environment of the early 21st century to more fully realize the ancient goals of the monograph. When Thucydides invented the form of the scholarly monograph, there were neither libraries of stable written sources nor, if such collections had existed, were there the systems whereby he could reliable cite these sources. In the historic passage where he describes his methodology, Thucydides reports that he has sifted the evidence and generated from this his best analysis of what happened (Thuc. 1.22).

7

(Martin 2006)

386

CHANGING THE CENTER OF GRAVITY

During the course of antiquity we find authors who begin to quote other authors and we begin to find references for previous works by author, work and even chapter length "books" (the amount that a papyrus scroll could conveniently store). Print technology allowed us to refine these citations so that we could describe precise variations between multiple editions of a single work. Demos set out in the twenty-first century to restore to discussion on Athenian Democracy the connection between statements and primary sources that the citation system on which Roman historian Edward Gibbon could already in the eighteenth century rely. Characteristics of a memography include: x Citation: A memography contains citations between statements and the evidence on which they are based. A memography differs from a traditional monograph because in a memography we know that authors have only been able to scrutinize a subset of the evidence cited. Citations in a memography include versioned queries: we can thus see what evidence was available at the time when the memography was completed and how that evidence has subsequently changed as new sources come on-line, existing analytical tools become more powerful or wholly new services emerge. x Scale: A project becomes a memography as its scope brings in more primary materials than a single human author can effectively analyze. Topics so vast that authors in print culture needed to focus their work on synthesizing specialized studies and could base their work primarily upon the primary sources would be subjects for memographies. The author must depend upon techniques such as sampling and automated analyses. A memography of George Washington would, for example, require, as one foundational dataset, the relative frequency of references to George Washington in multiple periods, genres, languages and cultural contexts. Such figures would require automated named entity analysis applied to very large collections. The memography would include a human author’s assessment of the accuracy of the automatically generated data. x Heterogeneity: Memographies include not only more content than authors can review but content that assumes

BLACKWELL AND CRANE

387

more categories of background knowledge than individual authors can expect to acquire. Such barriers can be language, cultural background, mathematics and any other topic. The history of mechanics could thus justify a memography because it requires not only a substantial understanding of mathematics and physics but sources produced over millennia and across Europe, North Africa and the Middle East in Greek, Latin, Arabic and every European language. Memographies thus require scalable, automated systems that can provide customized background information with which readers can examine and manually analyze any given object referenced. Thus, readers without training in Arabic but familiar with other languages and with the underlying scientific contexts can use automated morphological analyses, links to an on-line dictionary, and existing translations in languages that they do understand to pull apart Arabic source texts and determine which words are used in particular contexts to describe key concepts.8 Whether we are producing or reading (or both), most memographies will force us to interrogate primary materials from more contexts, linguistic, cultural or both, than we can expect to have studied in detail – the most powerful memes will work their way across time, genre, language and culture and it is this very quality that leaves a trail too long and complex for any single human mind. We must look to machines which can find and preprocess material relevant to a given meme through immense bodies of data. The heterogeneity of background knowledge brings us again to the need for a Cyberinfrastructure. The German-US Archimedes Project was able to assemble the machine readable dictionaries, online source texts, morphological analyzers, annotation systems and other resources needed to explore the history of mechanics. Scholars without training in Arabic were, for example, able to work ef-

8 http://archimedes2.mpiwgberlin.mpg.de/archimedes_templates/project.htm; http://archimedes.fas.harvard.edu/

388

CHANGING THE CENTER OF GRAVITY

fectively with materials in Arabic. Almost two decades ago, a formal evaluation of students using the first generation of Perseus reading tools had already demonstrated that students with no knowledge of Greek could produce analyses of Greek texts that, in the view of external evaluators, matched the performance of students with advanced training in the language.9 One major purpose of a Cyberinfrastructure is to generalize these results, providing a platform in which an increasing number of topics receive increasingly sophisticated services with even more dramatic results than those obtained by Perseus, Archimedes and other individual projects. New technologies can help us locate relevant currents in the vast oceans of source material but we will still need to descend from our overview and think carefully about some subset of the sources. While we will never be able to read everything, it becomes all the more important for us to ponder a few things, carefully selected, in great detail. In the past, practical issues such as language were fundamental barriers: if we found a text in a language that we could not read and did not have a human informant or translation, then we could literally do nothing. That condition has begun to change. This leads us to the topic of eClassics and Plato’s Challenge. eClassics and Plato’s Challenge Socrates: Writing, Phaedrus, has this strange quality, and is very like painting; for the creatures of painting stand like living beings, but if one asks them a question, they preserve a solemn silence. And so it is with written words; you might think they spoke as if they had intelligence, but if you question them, wishing to know about their sayings, they always say only one and the same thing. (Plato, Phaedrus 275d) Socrates: When one says "iron" or "silver" we all understand the same thing, do we not?

9

(Marchionini 1994)

BLACKWELL AND CRANE

389

Phaedrus: Surely. Socrates: What if he says "justice" or "goodness"? Do we not part company, and disagree with each other and with ourselves? Phaedrus: Certainly. (Plato, Phaedrus 263a) In a famous paper, published in 1950, Alan Turing proposed what has been since called the Turing test: a machine demonstrates intelligence when we cannot tell whether we are conversing with a human or a machine.10 We propose a simpler challenge based upon a critique posed in Plato’s Phaedrus. In that dialogue, Plato’s Socrates critiques writing as inert and voiceless – we can no more ask the written word to explain itself than we can carry on a conversation with a painting, however lifelike it may appear. In a digital world, however, we can begin to address this ancient critique: manually edited hyperlinks and search engines are only initial instruments by which readers can make digital texts less opaque than their print counterparts. Named entity analysis addresses questions such as "to which Alexander does this particular passage refer?" and enables services such as plotting the right Alexandria for a given passage on a map for the relevant chronological period. Simple dictionary look-up tools answer questions such as "what does this word mean?" Word sense disambiguation systems allow us to determine the probability of a particular word sense in a given context (e.g., Latin oratio as "speech" vs. "prayer"). Text mining systems elicit key words and phrases by which documents can begin to describe what they are about. We may be a long way from a meaningful answer to the Turing test, but even relatively simple technologies have allowed us to make progress against the challenge that Plato leveled against information technology two and a half millennia ago. Addressing Plato’s challenge has important implications for the problems that humanists choose to address. In 1972, Jacques

10

(Turing 1950)

390

CHANGING THE CENTER OF GRAVITY

Derrida published an essay, translated into English in 1981 as "Plato’s Pharmacy",11 that featured the critique of writing in the Phaedrus. In that dialogue, one speaker rejects the claim that writing aids memory – writing is not a medicine but a poison, encouraging us to depend upon writing and weakening our memories. Derrida’s essay probes the limitations of what we can express in language and thus, innovative as it may have seemed, reinforces the traditional scholarly focus upon questions that are obscure and may indeed have no final answer (e.g., topics prominent in the Phaedrus such as love and truth). If we address the Phaedrus test, however, we find ourselves looking at scholarship from the opposite direction. Derrida pondered the limitations of language and logic. More traditional scholars such as the Latinist D. R. Shackleton Bailey pondered the best variant reading in a given text or to which Antonius a particular passage referred. Both focused upon the extremes, where language or the historical evidence at our disposal had not been sufficient for human analysis to generate a final decision. Scholarship largely focused on outliers. In addressing Plato’s challenge, we focus less upon the 2% of instances where we cannot readily determine to which Antonius an author refers than upon the other 98% where any reader, familiar with the context, can determine the intended referent. To address Plato’s challenge, we need to maximize a machine’s ability to recognize the dizzying number of simple referents that expert readers understand without conscious effort. We shift from pondering the un-decidable to representing deceptively simple operations in machine actionable form that we can apply billions and billions of times. While we will continue to ponder the meaning of concepts such as "justice" and "goodness," we now need systems that can reliably distinguish "iron" as metal from the verb by which we press clothing. In classics, we could use a lexicon with more up-todate information of the various meanings of the Greek word ΦΕΛφ, but we need systems that can, with reasonable accuracy,

11

(Derrida 1981)

BLACKWELL AND CRANE

391

distinguish where Greek ΦΕΛφ corresponds more closely to English "beginning" or "empire." The introduction to this collection has already called for a Cyberinfrastructure, including both collections and services, that can make an ever increasing body of knowledge about the GrecoRoman world intellectually, as well as physically, accessible to an ever widening global audience, supporting many languages and cultural backgrounds. To accomplish this goal, we need not only clever software and well-curated knowledge sources but vast collections from which we can harvest increasingly larger amounts of machine actionable knowledge.

CLASSICS AND CYBERINFRASTRUCTURE The articles in this collection document a range of efforts, each of which is farther along today because of Ross Scaife’s patient and indeed loving support. We see no field within the humanities that has either made the material progress towards – or, even more important, fostered a community to develop and then use – infrastructure on which all of the humanities must depend in a digital world. In this section, we outline a plan forward and argue that any Cyberinfrastructure for the humanities as a whole should begin with classics. The center of gravity for intellectual life in every developed or developing society is now digital and humanity has already begun to arrange an infrastructure around that new center. The term Cyberinfrastructure, however, emerges from the National Science Foundation (NSF) of the United States and it was the NSF that funded the workshop from which this collection emerges.12 We therefore begin this section by engaging with a discussion that may

12 The NSF first began to use Cyberinfrastructure as a strategic term in a 2003 report often called the "Atkins Report", Atkins, Daniel E., et al. (2003). "Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure." http://www.nsf.gov/od/oci/reports/atkins.pdf (Accessed on October 1, 2008).

392

CHANGING THE CENTER OF GRAVITY

at first appear peculiar to the United States. In fact, our argument applies as well to Europe, China and every nation, large and small: if we are to prosper in the present, we must better understand the pasts of every community with whom we come in contact. Each nation needs Cyberinfrastructure that can not only preserve, augment, and export ideas about its own cultural heritage but that can import and make as intellectually accessible as possible the cultures, histories, and languages from the rest of humanity. Europe may preside over more languages and a longer historical record, but Europe is not the world. Nor is China, or India, or the Middle East – or all of these together. Every point on the globe is connected. Every cultural community must be prepared to interact with every other, whether history has, for better or worse, already bound them together for millennia or the contingencies of history have kept them apart. Within this larger context Greco-Roman antiquity provides a logical starting point for development. Several reasons stand out: First, Greco-Roman antiquity provides a cultural heritage that is fundamentally international. The Greco-Roman world physically stretched from Ukraine to Spain, from Morocco to Iraq, and from England to the Sahara. Intellectually, the Greco-Roman world provides a foundation for the entire Western Hemisphere. The two largest entities within this space, the United States and the European Union, must collaborate with each other and with every other group that can contribute. A focus upon Greco-Roman antiquity can thus balance the focus upon cultural heritages for which particular nation states must take responsibility. In the United States, we run the risk of replicating in our cultural infrastructure the Anglophone, geographically isolated, culturally leveling tendencies of our history and not preparing for the multi-lingual, physically interconnected, culturally complex world in which we actually live. Any Cyberinfrastructure for classics should draw seamlessly and naturally upon resources scattered across the globe. Second, though this collection has focused primarily upon the textual record, the vast body and variety of data about the ancient world come from archaeology. The study of the Greco-Roman world demands new international practices with which to produce and share information. The next great advances in our understanding of the ancient world will come from mining and visualizing the full record, textual as well as material, that survives from or talks

BLACKWELL AND CRANE

393

about every corner of the ancient world. Individual nations will be best able to document the physical remains within their borders by integrating locally produced data in international networks of interoperable data. Cyberinfrastructure for Greco-Roman antiquity provides strong, constructive motives for individual ministries of cultures and similar institutions to think globally as well as locally. Third, beyond the influence of any one nation there exists today a finite textual corpus that has exerted and continues to exert, directly and indirectly, an immense influence upon human life. Much of this textual corpus and an increasing body of machine actionable knowledge associated with it is already available under open licenses. Fourth, Greco-Roman antiquity demands a general architecture for many historical languages. Even if we focus upon Greek and Latin, once we begin to contextualize these languages, we will find that we need to work with materials about the ancient near east of which Greece was one component and thus with languages such as Sumerian, Akkadian, Hittite, Old Persian, Coptic and Hebrew. As we consider the reception and influence of Greco-Roman culture, we must work with Syriac and Arabic, as well as with every language of Europe. To work with so many historical languages, we must develop an architecture that can integrate language specific content and services with general services. While we may focus initially on the languages and cultures of the Mediterranean and the Near East, these subjects, daunting as they may be, provide only a component of an environment that must include the historical languages and cultures of the Indian subcontinent, Asia and the rest of the world. Fifth, contemporary classical scholarship is multilingual. Many scientific disciplines manage the language problem by concentrating their publications in English. North American and European classicists alike are conventionally responsible for anything written in, as a minimum, English, French, German and Italian, while classical scholarship appears in Spanish, Modern Greek, Russian, Croatian, Dutch, and any other language spoken by classical scholars.

394

CHANGING THE CENTER OF GRAVITY

Technologies such as cross language information retrieval (CLIR) are well-established and would be essential in a field such as classics, where scholars want to pose queries in one language to retrieve results in at least four modern languages for which they are officially responsible.13 Classics is one of the most fundamentally multilingual intellectual communities in the academy and provides the best humanities community within which to explore genuinely multilingual infrastructures. Sixth, our knowledge of the Greco-Roman world casts light upon residents of areas that were at some point part of the GrecoRoman world who are not professional academics. We have natural audiences who speak not only every language of Europe but Arabic, Farsi and Turkish. We must address the challenges not only of professional academics with extensive linguistic training in a handful of languages but of general audiences as well. Seventh, classical scholarship begins the continuous tradition of European literature and continues through the present. Classicists have in recent years led projects on topics such as the history and topography of London, multitexts of Marlowe and Shakespeare, the history of science, 19th century newspapers, and the American Civil War. These have provided us with tangible grounds to argue that the problems of classical studies raise a superset of issues that appear in the humanities before the rise of time-based media such as films and sound. An infrastructure that provides advanced services for primary and secondary sources on classical Greek and Latin includes inscriptions, papyri, medieval manuscripts, early modern printed books, and mature editions and reference works of the 19th and twentieth centuries. Even if we restrict ourselves to textual sources, those textual sources provide heterogeneous data about the ancient world. If we include the material record, then we need to manage videos and sound about the ancient world as well. A major classics development project should have allied projects, sharing the same infrastructure in representa-

13 For a survey of recent work, see the Cross Language Evaluation Forum, which has been evaluating CLIR since 2000: http://www.clefcampaign.org/

BLACKWELL AND CRANE

395

tive domains (e.g., the History of Science, early modern studies, 19th century Anglo-American history and literature). Eighth, classicists have already devoted a generation to developing collections and services. They need a more robust environment and are ready to convert project-based efforts into a shared, permanent infrastructure. They have begun to outgrow the physical systems which they can, as projects, reasonably support. We thus shift discussion to the collections and the services that have already been developed to describe what is now feasible in this field. Services for eClassics Services define what we can accomplish. We develop collections in conjunction with services – even if that service consists solely of a mechanical lookup (e.g., call up a particular passage by chapter and verse). We cannot call up Homer, Iliad, book 9, lines 44-48 unless we have a digital text and structural markup for books and lines of Homer’s Iliad. Backend services capture those processes that are available automatically for all textual materials. Every classification service implies both browsing, search and visualization services: i.e., if we identify commentaries among OCR-generated text, we can search for all commentaries; if we recognize that fecit is a form of facio (Latin "to do, make"), then we can query facio and retrieve fecit. They provide data on which customization and personalization services draw and to which users respond with corrections and additions. Much fundamental work remains to be done on discovering and perfecting services relevant to classical studies that technology already enables. Ultimately, the decisions become social – technology establishes what is possible but only those engaged in the study of classics can assign, whether by conscious decision or default action, relative values to the services that could be built. Nevertheless, after decades of collection development within the field of classics, a number of services have begun to emerge, some of them actively used for years. The following services represent a

396

CHANGING THE CENTER OF GRAVITY

core set, and should be components in any cyberinfrastructure for classical studies.14 The following list offers a minimal set of services, each of which can be built with the technologies available today and each of which addresses established problems relevant to classicists in particular and many humanists. The services below largely address the problem of classification, i.e., applying a set of criteria to find and/or to label materials. Different annotation tasks admit of different levels of certainty: human readers can identify the correct transcription for print on a modern page but lexicographers will disagree on the senses of a given word. Nevertheless, these services aim at more or less deterministic, right-or-wrong answers. We do not include below clustering and other techniques that can detect patterns that require new categories. The services below reflect basic tools on which more open-ended research depends.

Canonical Text Services (CTS) Canonical text services allow us to call up canonical texts by standard chapter/verse citation schemes. Christopher Blackwell and Neel Smith, working in conjunction with Harvard’s Center for Hellenic Studies (CHS), have developed a general protocol for canonical text services that provides essential functions for any system that serves classicists – or any scholarly community working with canonical texts.15 Early modern books or MSS that defy current OCR technology can be indexed by conventional citation (e.g., this page of the Venetus A manuscript contains the following lines of the Iliad).

14 For a discussion of cyberinfrastructure and classical studies, albeit with a focus on scholarly publishing, please see (Pritchard 2008) 15 For more on CTS, please see (Porter 2006) and (Romanello 2008).

BLACKWELL AND CRANE

397

Optical Character Recognition and Page Layout Analysis Transcription captures the keystrokes. Page layout analysis captures the logical structures implicit in the page.16 These logical structures include not only header, footnote, chapter title, encyclopedia/index/lexicon entry etc., but more scholarly forms such as commentary and textual notes. All disciplines have used tables to represent structured data and we need much better tools with which to convert tabular data into semantically analyzed machine actionable data.17 Much of the work in the Mellon funded Cybereditions Project will focus on this stage of the workflow, focusing on the problem of mining highly accurate data from OCR output of scholarly editions in Greek and Latin.

Morphological Analysis Morphological analysis takes an inflected form (e.g, fecit) and identifies its possible morphological analyses (e.g., 3rd sg perfect indicative active) and dictionary entries (e.g., Latin facio, "to do, make"). David Packard developed the first morphological analyzer for classical Greek, Morph, over a generation ago.18 Gregory Crane began the initial work on what would become the core morphological analyzer for Greek and Latin in Perseus in 1984. Neel Smith and Joshua Kosman, then graduate students at Berkeley, extended this work and created a library of subroutines that remain part of the current code base for Morpheus. Morpheus is written in C, has been compiled on a range of Unix systems over the course of more than twenty years, and contains extensive databases of Greek and Latin inflections and stems. Of all the classics specific services with which we are familar, Morpheus is the most mature and well developed. The goal has long been to create an open source version of

16 For an overview of the state of the art in this area, please see (Sankar 2006), and for some recent experiments with texts from the Open Content Alliance, please see (Lu 2007). 17 Some promising initial work in table extraction from digital documents has been reported by (Liu 2007). 18 For more on Packard’s system, see (Packard 1973), and for a discussion of Crane’s Morpheus, see (Crane 1991).

398

CHANGING THE CENTER OF GRAVITY

Morpheus. Desiderata include new documentation, modern XML formats for the stems and endings and a distributed environment whereby users can add new stems and endings.

Syntactic Analysis Syntactic analysis identifies the syntactic relationships between words in a sentence; it allows us to provide quantitative data about lexicography (e.g., which nouns are the subjects and objects of particular verbs), word usage (e.g., which verbs take dative indirect objects? where do we have indirect discourse using the infinitive vs. a participle vs. a conjunction?), style (e.g., hyperbaton, periodic composition), and linguistics (e.g., changes from SOV to SVO word order). Even relatively coarse syntactic analysis can yield valuable results when applied to a large corpus: working with our morphological analyzer and a tiny Latin Treebank of 30,000 words with which to train a syntactic analyzer, we were able to tag 54% of the untagged words correctly, but the correct analyses provided a strong enough signal for us to detect larger lexical patterns.19 More robust syntactic analysis based on very large treebanks can yield accuracies of 80 and 85%. Human annotators can build upon preliminary automated analysis to create treebanks, where every word’s function has been examined and accounted for. Treebanks provide not only training data for automated parsing but also explanatory data whereby readers can see the underlying structure of complex sentences – a valuable instrument to support interdisciplinary researchers from fields such as Philosophy or the History of Science who are not specialists in Latin and Greek.

Word sense discovery Word sense discovery automatically identifies distinctive word usage in electronic corpora. Even without syntactic analysis, collocation analysis can reveal words that are closely associated (e.g., phrases such as the English "ham and eggs") and thus identify

19 The development of the Latin treebank has been documented in (Bamman 2006) and (Bamman 2007).

BLACKWELL AND CRANE

399

idiomatic expressions.20 Jeff Rydberg Cox developed collocational analysis for the Greek and Latin texts in Perseus and the results are visible as part of the on-line Greek and Latin lexica in Perseus 3.0. 21Access to translations aligned to the original allows us to identify distinct senses: e.g., oratio corresponds both to English "oration" but in other instances to English "prayer." At Perseus, we have been experimenting with this technique since 2005 and have begun a project, funded by the NEH Research and Development Program, to explore methods for a Dynamic Lexicon for Greek and Latin.

Named entity Identification Named entity identification provides semantic classification (e.g., is Salamis a place or a Greek nymph by that name) and then associates names with particular entities in the real world (e.g., if Salamis is a place, is it the Salamis near Athens, Salamis in Cyprus or some other Salamis?).22 We have developed a serviceable named entity identification system for English and have support from the Advancing Knowledge IMLS/NEH Digital Partnership (http://www.neh.gov/news/archive/20080826.html) to extend this work to documents about Greco-Roman antiquity.23 We expect more general named entity systems to supersede the system that we developed and we are therefore focusing our efforts on creating knowledge sources that will allow these more general systems to perform effective named entity identification on classical

For some relevant research in this area see (Cardey 2006), and for a recent overview of word sense discovery please see (Pantel 2002). 21 The entry for the Latin word ira ("anger") provides: Some Words that Regularly Appear with ira In Latin Prose (255 total): indignation stimulo furor ex-ardeo saevio In Latin Poetry (48 total): suscito saevio saevus Juno exerceo In Latin Texts (356 total): indignation stimulo furor ex-ardeo Saturnius 22 For an overview of named entity research, see (Nadeau 2007) and for an example of named entity identification in historical texts, see (Byrne 2007). 23 The named entity system for English has been documented in (Crane 2005), and its results on one collection evaluated in (Crane 2006). 20

400

CHANGING THE CENTER OF GRAVITY

materials. Our work focuses on creating (1) a labeled training set, based on print indices, with place and personal names identified, (2) a multilingual list of 60,000 Greek and Latin names in Greek, Latin, English, French, German, Italian, and Spanish, and (3) contextual information, or in other words, which authors mention which people and places in which passages, extracted from the 19th century encyclopedias of biography and geography edited by William Smith. Metrical Analysis Metrical analysis both discovers and analyzes the underlying metrical forms of digital texts. Metrical analysis provides information about vowel quantity that can improve performance of morphological, syntactic and named entity analysis. Metrical analysis is particularly important for areas such as post-classical Latin, which have very large bodies of poetic materials that will never receive the manual analysis applied to Homer, the Athenian Dramatists, Vergil and other canonical authors.24

Translation Support Translation support aims at fluent translation of full text but can provide useful results at a much earlier stage of development. Thus, word sense disambiguation, a component within machine translation, helps translate words and phrases: e.g., given an instance of the Latin word oratio, word sense disambiguation identifies when that word most likely corresponds to "oration", "prayer" or some other English word or phrase.25 The same service also supports semantic queries such as "list all Latin words that correspond to the English word ‘prayer’ in particular contexts."

For recent work in metrical analysis see (Eder 2007). For more on word sense disambiguation, please see (Carpuat 2005), and also (Ide 1998). 24 25

BLACKWELL AND CRANE

401

Cross Language Information Retrieval (CLIR) Cross language information retrieval (CLIR) allows users to pose a query in one language (e.g., English) and retrieve results in other languages (e.g., Arabic or Chinese). For classics, CLIR is an extremely important technology because classicists are expected to work with materials not only in Greek and Latin but, at a minimum, in English, French, German and Italian. CLIR is a mature technology where the cross language queries in some competitions perform better than the monolingual baseline systems (e.g., you get better results searching Arabic with an English query than if you searched with Arabic).26 Classicists should be able to type queries for secondary sources in various languages such as English, French, German or Italian.

Citation Identification Citation identification is a particular case of named entity identification that focuses on recognizing particular: e.g., determining whether the string "Th. 1.33" refers to book 1, chapter 33 of Thucydides, line 33 of the first Idyll of Theocritus or something else? Are numbers floating in the text such as "333" or "1.33" partial citations and, if so, what are the full citations? Primary source citations tend to be shorter and more variable in form from the bibliographic citations found in scientific publications. Perseus has, over the course of more than twenty years, extracted millions of citations from thousands of documents but the citation extractors tend to be ad hoc systems tuned for the subtly different formats by which publications represent these already brief and cryptic abbreviations. In the million book world, we need citation extractors that can recognize the underlying citation conventions of arbitrary documents and then match them to known citations on the fly (e.g, observe numerous references to Thucydides and then infer that strings such as "T. 1,33" describe Thucydides, Book 1, Chapter 33).

26 For some recent overviews of the potential of CLIR for digital libraries and technical issues still to be solved, please see (Jones 2007) and (Petrelli 2006).

402

CHANGING THE CENTER OF GRAVITY

Quotation Identification Quotation identification can recognize where one text quotes – either precisely or with small modifications – another even when there is no explicit machine actionable citation information: e.g., it can recognize "arma virumque cano" as a quotation from the first line of the Aeneid. The fundamental problem is analogous to plagiarism detection.27 Support from the Mellon-funded Classics in the Million Book Library study allowed us to begin work on exploring quotation identification techniques.28

Translation identification Translation identification builds on both CLIR and quotation identification to identify translations, primary but not exclusively, of Greek and Latin texts that are on-line in large digital collections.29 These translations may be of entire works or of small excerpts.

Text Alignment Text alignment services most commonly align translations with their source texts and are components of word sense disambiguation systems.30 Text alignment, however, serves also to create human readable links between source texts and translations that do not have machine actionable book/chapter/section/verse or other citation markers or between source texts that are tagged with different citation schemes. Text alignment is one of the priorities of the Mellon-funded Cybereditions Project at Tufts University.

27 For more on plagiarism detection and quotation identification, see (Zavlavsky 2001). Google Books has also recently launched a quotation identification and tracking feature, see (Schilit 2008). 28 The results of this work can be found in (Ernst-Gerlach 2008). 29 Recent work in translation detection has been conducted by (Pouliquen 2003). 30 For recent work in text alignment see (Deng 2006).

BLACKWELL AND CRANE

403

Version Analysis Version analysis services can collate transcriptions of manuscript sources or of different printed editions of the same work.31 Version analysis can also be used for automated error correction: when two versions of a text differ and one version contains a word that does not generate a valid Greek and Latin morphological analysis, we flag that word as a possible error and associate the parseable word from the other text with it as a possible correction.32

Markup Projection Markup projection services, implicit in many of the services above, automatically associate machine actionable data from one source with the same passage in another source. Thus, an index might state that a reference to Salamis in passage A describes Salamis near Athens but that the reference in passage B is to Salamis of Cyprus. Markup projection services would associate those statements with all references to Salamis in various versions of passages A or B, including not only full scholarly editions but also quotations of those passages that appear in journal articles or monographs Collections for ePhilology The fifteen basic services described above provide mechanisms whereby human beings can think about the ancient world. Services are dynamic processes that depend upon the algorithmic processing of pre-existing materials. Google and similar comprehensive organizations succeed insofar as they have identified very general algorithms that can generate useful results over thousands of domains to millions of users. Algorithms are the core of computer science. Computer scientists seek to maximize what can be computed and to minimize the pre-existing knowledge that a system needs. In this context, if we can associate 90% of the geographic names in 90% of the English language internet with their locations

For relevant work in this area, please see (Toselli 2007). Influential work in this area that has greatly informed our own research has been conducted by (Feng 2006). 31 32

404

CHANGING THE CENTER OF GRAVITY

to which they refer, we may decide that the problem has been solved. Much of the work underway focuses upon such first order approximations which are good enough for many people in many contexts. The remaining 10% or 5% or even 1% may, however, be the space in which the most interesting intellectual work takes place and thus the locus of that value which a digital environment can offer. First, we may be most interested in finding the uncommon instances that are much harder to find. Thus, it is easy to score well on an ambiguous name such as Washington if we are looking for George Washington or Washington state but much harder if we are looking for Washington, MA, or Washington, GA. Second, we need to consider the issues of context. The patterns that we find in English language documents from India and South Africa will, of course, differ from those that we find produced in the US and the UK. If we remain focused on the United States, the 1855 Harper’s Gazetteer of the World lists more than 150 places named Washington. The early 21st century version of the Getty Thesaurus of Geographic Names (TGN) with which Perseus researchers worked contained only 90 Washingtons. Thus even if we are working with American materials in English but we shift our attention a century and a half into the past, the services optimized for the present rapidly degrade. If we push back into English collections from the 18th century the problem worsens. If we work with early modern documents in English before standardized spelling, the problem grows more complex still. And if we are working with materials in other languages, our generic services may not only degrade but be useless – how may place names can we find in Latin much less Greek or Syriac? Scholarship has always begun where obvious conclusions are not available or, on deeper inspection, prove inadequate. In most cases, readers within a scholarly community can automatically identify the people and places cited by a text but in a small percentage of instances, these references are unclear. Scholars have spent generations trying to decide to which Antonius a particular text refers or which variant reading among the manuscripts (if any) most probably reflects what Aeschylus composed. We may well be able to identify what texts of Plato people have read in dozens of languages over thousands of years and see in a form that we can understand the sorts of things that people have said about Plato as a

BLACKWELL AND CRANE

405

whole, a particular work of Plato or a particular passage. But such automated analyses and visualizations provide only the starting point for meaningful interpretation. In this digital age, a major – and indeed, perhaps the important – portion of our work must center on the space between where the machines can bring us and where our intellectual aspirations lead. As technology advances, some scholarly tasks become wholly automated and are thus obsolete as effective instruments of scholarship. We may print the results of word searches as keywords in context but the production of print concordances is at best a problematic activity: we are better off creating an electronic text and then shuffling the words via various algorithms. If we want to create more sophisticated visualizations, we are better served marking the source text (e.g., identifying each dictionary entry) to create a particular view of that data (e.g., a dictionary organized by dictionary entry rather than inflected form). The following categories of document provide some, though by no means necessarily all, of the foundational data on which we base our work with primary sources. Each constitutes a structured environment through which we human authors communicate with other authors and with automated systems. Each category of document can play the following roles: x Training data: Many systems depend upon a training set in which human annotators classify phenomena (e.g., “bank” in passages x, y, z corresponds to a financial institution, but to a “river bank” in passages a, b, c). Part of each training set is set aside to serve as a gold standard: we test various learning algorithms by training on one part of the training set and then comparing how well it performs on the part that we set aside. Training data thus does not have to be perfect to be useful – in fact, perfection is not a relevant category. In reality, training sets include at least some ambiguous examples and a mature environment must be able to distinguish levels of certainty/community agreement. x Corrections and augmentations: All of the services outlined above include some element of probabilistic analysis. We may be able to identify all variations across multiple versions of a text but still need to refine the ways in which our system classifies differences (e.g., two texts may

406

CHANGING THE CENTER OF GRAVITY

differ because of an OCR or data entry error rather than because of an editorial change). Our reference works should, insofar as possible, draw upon and then refine and augment an initial automated analysis, thus allowing us to focus time on those instances where we want to change or add to what the machines have done. Our reference works thus provide a place to store our response to the automated work. The audience for these reference works will include both human readers and automated systems which will use the reference works as a training source with which to provide better results. x Models and argumentation: In the end, human authors will continue to analyze, reflect and pose arguments. Print editions, lexicon entries, and even indices of people and places contain models for what an author wrote, what words mean, and what we think we know about the people and places in a document. In a digital environment, these models must be explicit and, where appropriate, encoded into a machine actionable form. Their accompanying arguments must build upon automated methods not available in print when these methods are relevant. We need reasoned arguments and these will retain a familiar expository structure but accompanying data sets may be what have the greatest impact upon intellectual life. The next monographic study of a Greek word, for example, should include annotations that link the findings of and arguments behind that study with the passages to which they are relevant. The following describe some of the document types that we need in a digital environment. To some extent they all reflect components of comprehensive digital editions and each contributes to the roles that textual data can play in a digital environment.33

33 The literature on the nature of digital editions is quite extensive, for some recent explorations of the topic see (Robinson 2005) and (Dekhtyar 2006).

BLACKWELL AND CRANE

407

Multitexts The contribution of Dué and Ebbott in this collection outlines the concept of a multitext. We use the term multitexts here to describe methods to track multiple versions of a text across time. The term multitext does not mean that editors cannot produce their best attempt to reconstruct a source text no longer available to us – we can represent a multitext as a network of versions with a single, reconstructed root. We may well find that the new linguistic and analytical resources at our disposal – especially resources such as treebanks and other categories of linguistic annotation – will allow editors to place old questions on a fundamentally new foundation and to provide new insights into the editions that classical authors produced of their works. The term multitext does, however, insist upon our ability to track and compare versions over time. In many cases, the original words of an author are as relevant as the Hubble telescope was to Galileo. Petrarch and Machiavelli did not read Teubner Editions or Oxford Classical Texts. We are in a position to begin modeling the texts of our authors as they appeared at different points of time and even the textual universes in which different actors works. Scholars in early modern studies, for example, need systems that can show us at a glance how various sixteenth and seventeenth century editions of classical authors differ from the modern editions that they have laboriously read. First, digital editions are designed from the start to include images of the manuscripts, inscriptions, papyri and other source materials, not only those available when the editor is at work but those which become available even after active work on the edition has ceased.34 This is possible because a true digital edition will include a machine actionable set of sigla. Even if we do not yet have an internationally recognized set of electronic identifiers for manuscripts, the print world has often produced unique names (e.g., LIBRARY + NUMBER) that can later be converted into whatever standard identifiers appear. A mature digital library system manag-

34

For further discussion of this issue, see (Monella 2008).

408

CHANGING THE CENTER OF GRAVITY

ing the digital edition will understand the list of witnesses and automatically search for digital exemplars of these witnesses, associating them with the digital edition if and when they come on-line. If the digitized exemplars have associated citation data (e.g., page X in MS Y corresponds to lines M to N of the Iliad, segment A,B,C,D of a given page corresponds to line 38 etc.), then the digital library system can automatically select the page or page segment relevant to a given section of the edition. If that metadata is not present, then the reader will simply have to find the relevant section by flipping through the electronic pages of the witness. Second, multitexts are versioned: they encode not only one reconstructed edition produced by one editor but are designed from the start to represent multiple editions.35 Any reader should at any time be able to call up visualizations and analyses of multiple editions, seeing which editions are more closely related, which editions had the greatest impact on subsequent editions, which editions are more dependent on particular witnesses, etc. Third, multitexts include multiple apparatus critici, but these apparatus critici are machine actionable. Machine actionable means that textual comments are encoded in such a way that readers can compare the text with readings from MS A vs. MS B and/or select their own readings. While there can be multiple apparatus critici, each apparatus criticus must build upon the same set of common identifiers: a machine must be able to determine that B in one apparatus criticus corresponds to V in another.

Parallel Texts The multitext as described above only covers versions of a text within a single language. In many cases, however, literary texts have exerted their influence in translations that were one or more languages removed from the original. Shakespeare’s worked with Thomas North’s translation of Plutarch, but Thomas North translated Jacques Amyot’s French translation of Plutarch, rather than

35 For more on the importance of supporting versioned multitexts, see (Schreibman 2003).

BLACKWELL AND CRANE

409

Plutarch’s Greek. We have to remember that many Greek texts exerted much of their influence when they circulated in Latin or Arabic translation. We need parallel texts of multiple linguistic versions The contribution of Bamman and Crane to this collection introduces the concept of parallel texts and their application to lexicography. Parallel texts can include a single edition and translation (like the Loeb and Budé series) but can also include multiple translations in multiple languages aligned with multiple editions (e.g., an Italian translation of Aeschylus that contains variant translations for a number of major editions). Parallel texts assume some level of common citation schemes: e.g., chapter 86 of book one of Thucydides in an English translation roughly corresponds to the Greek in chapter 86 of book one of Thucydides in standard editions. The more numbered sections, the more precisely citation schemes can align source texts and translations. Parallel text analysis and automatic alignment software can, however, discover many instances where words in the translation correspond to words in the source text. Even if we restrict ourselves to high probability correspondences, we can align our texts far more closely than any traditional citation system. Put another way, once we have page sized chunks of text and translation aligned, automatic alignment can do a better job than manually added structures such as section markers. Such section markers are probably most useful for human readers who want to extract logical chunks. Automatic alignments would be familiar to those who work with Plato and Aristotle, where editions use the page breaks and page sections of particular editions rather than the logical structure of the text itself. Once we have established the correspondences between different linguistic versions of the text, we need automated methods to help identify likely locations where those versions diverge, whether because a translator misunderstood the original or because the idea of translation was looser than that of later periods. Finally, we need methods whereby scholars can annotate these differences according to the patterns which they determine are significant.

WordNets and Machine-Actionable Dictionaries The contribution of Bamman and Crane in this collection also introduced some of the possibilities for dynamic lexicography in a digital environment. WordNet and EuroWordNet are pragmatic

410

CHANGING THE CENTER OF GRAVITY

examples of semantic networks, associating words with similar meanings into hierarchical classes.36 WordNet in particular has emerged as a major tool within computational linguistics and similar resources for Greek, Latin and other historical languages would be an important contribution.37 Machine actionable dictionaries may resemble traditional lexica in format but differ in that they contain far more citations than could ever be printed, they can be updated continuously, and their information is from the start structured to support morphological, syntactic and semantic queries. True machine actionable dictionaries must articulate word senses in such a way as to help both human and machine readers to recognize these senses as precisely as possible.

Treebanks, Linguistic Annotations, and Machine-Actionable Grammars Treebanks are databases that label the syntactic role of each word in a set of sentences. These syntactic tags constitute parse trees (hence the name) that can be used to analyze lexical, syntactic and even rhetorical patterns.38 Treebanks tend to have fairly compact tagsets – they might not encode purpose clauses per se but allow users to query for patterns such as ut followed by a subjunctive. Syntax is important but by no means the only subject of linguistic annotation. Co-reference annotation maps pronouns to their referents (e.g., "he" in passage X refers to Julius Caesar). Annotation languages have emerged to capture higher level semantic phenomena such as temporal expressions (TimeML).39 We use machine actionable grammars to describe resources comparable to print grammars. These may have hundreds or thousands of observations, each roughly corresponding to the num-

36 http://wordnet.princeton.edu/; http://www.illc.uva.nl/EuroWordNet/ 37 A bibliography of the research publications using WordNet can be found at http://lit.csci.unt.edu/~wordnet/. 38 Bamman and Crane in this collection; for sample Treebank data and bibliography, see http://nlp.perseus.tufts.edu/syntax/treebank/ 39 http://www.timeml.org/.

BLACKWELL AND CRANE

411

bered paragraphs of their print predecessors. But in a machineactionable grammar, each paragraph would include not only citations but a set of patterns (e.g., ut heading a subordinate clause followed by the subjunctive) and some indication of the precision (how many false hits the pattern would retrieve) and recall (how many correct hits the pattern would miss). The machine-actionable grammar would thus build on the treebank. Where the treebank would stress use of a smaller number of categories to describe the relations of individual words, machine readable grammars would suggest an open-ended set of more complex phenomena inferred from the corpus.

Machine-actionable indices of people, places, organizations, etc. The contribution by Elliott and Gillies in this collection outlines the major issues surrounding geographic information in classical studies. We also need to represent information about people, organizations, technical/scientific terms and other entities with regular features. The underlying principal of machine actionable indices is the same as that of their print antecedents. Machine actionable indices differ in at least two ways. First, the structure of the index entries is explicit: we can extract headwords, hierarchical structures (e.g., "Athens, (1) Religion …. (2) Government …") descriptive labels (e.g., "born at X," "stood for consul in Y"), and associated citations. Second, index headwords contain the most general possible identifiers. Thus, we don’t simply cite Athens, Greece, or Thucydides the Historian, but add the identifiers such as the numbers for Athens (TGN 7001393) and Thucydides (TLG 0003) in the Getty Thesaurus of Geographic Names (TGN) and the Thesaurus Linguae Graecae Canon (TLG) respectively.40

40 See http://www.getty.edu/research/conducting_research/vocabularies/tgn/; http://stephanus.tlg.uci.edu/canon/fontsel

412

CHANGING THE CENTER OF GRAVITY

Propositional Knowledge Propositional knowledge includes standard database fields: e.g., author=Thucydides + Title=History-of-the-Peloponnesian-War in effect states that Thucydides is the author of the History of the Peloponnesian War. Propositional data is, however, designed to support reasoning: e.g., if two people share the same two parents, then we can infer that they are also siblings; if someone was born after an author died, then the works of that author cannot refer to that person. Such propositional reasoning rapidly becomes computationally complex. More significantly, the underlying propositions rapidly become idiosyncratic, as each observer creates slightly different categories and our propositional knowledge becomes internally inconsistent – as soon as computer scientists began converting print reference works such as the Oxford English Dictionary to digital form, they discovered that human editors were never fully consistent.41 The Historical Event Markup and Linking (HEML) which Bruce Robertson describes in his contribution to this collection illustrates the measured use of an ontology to do a great deal but not too much – HEML did much to shape the newest extensions in the Text Encoding Initiative (TEI) methods for representing named, dates, people and places.42 If we restrict ourselves as much as possible, however, to established ontologies (a common set of propositions), then we can build off the work of others. Insofar as we can share the same ontologies with broader communities, we have a chance to create propositional knowledge that can be integrated with propositional knowledge from other sources, creating a much larger and more powerful knowledge base than any single project could develop. Put another way, a large number of propositions describing a finite set of well-defined phenomena will probably yield far more useful results. Individuals may extend shared vocabulary with their own categories but retain a common set of

(Raymond 1987). See http://heml.mta.ca/heml-cocoon/; c.org/release/doc/tei-p5-doc/en/html/ND.html 41 42

http://www.tei-

BLACKWELL AND CRANE

413

categories by which at least part of their data can interact with other systems. All of the reference works listed above depend upon propositional knowledge of the form "A has-property B": the string "Arma virumque cano, Troiae qui primus ab oris" hascitation Vergil-Aeneid-book-1-line-1; fecit has-language Latin and fecit has-morphological-analysis; archê-in-passage-X has-sense "empire." A treebank contains compound propositional statements such as agricola is-a noun and agricola is-subject-of fecit. We include propositional knowledge as a separate category to emphasize categories not included above. Thus, the CIDOC-CRM ontology includes a wide range of categories for art and archaeological objects and HEML provides a vocabulary for describing people, places and events in time.43

Commentaries A true digital commentary must build judiciously upon all of the tools listed above. Full commentaries should include annotations identifying every phenomenon of interest to its intended audience: every word should be morphologically disambiguated, every sentence should have its syntactic data encoded; every major variant should be labeled; every person and place should have at least one identifier from a general work or a label indicating that this is a place/person/institution not yet in available reference works and a new identifier. Put another way, if scholars have developed a widely recognized classification scheme (word senses in a lexicon, numbered paragraphs in a standard grammar, metrical analyses), then fully commented texts will have categorized every instance of each relevant phenomenon in a text. And, of course, commentaries must from the start allow commentators to include variant explanations for the same phenomenon (e.g., proposographic disputes about which Antonius is meant, textual arguments about which reading is correct).

43

http://cidoc.ics.forth.gr/

414

CHANGING THE CENTER OF GRAVITY

PUBLICATION FOR A CYBERINFRASTRUCTURE An Athenian citizen does not neglect the state because he takes care of his own household; and even those of us who are engaged in business have a very fair idea of politics. We alone regard a man who takes no interest in public affairs, not as a harmless, but as a useless character; and if few of us are originators, we are all sound judges of a policy. The great impediment to action is, in our opinion, not discussion, but the want of that knowledge which is gained by discussion preparatory to action. (Thuc. 2.40.2, after Crawley)

For us, public affairs go beyond the individual decisions of a particular government but extend to all discussion. We may be professional academics, privileged to earn a living by working on the subjects to which we have dedicated our lives, but we enjoy that privilege because we serve the broader interests of humanity. Our work within the academy is only a means towards the greater goal of supporting intellectual life and the general understanding of the past. Before discussing some of the essential features that characterize true publication in a digital age, we distinguish, in the context of this discussion, archives and libraries. For our purposes, libraries provide the foundation on which public discourse takes place. Libraries constitute the most advanced and efficient space with which society is able to conduct discourse that extends across time and space and that depends upon preservation of, and access to, the terms of discussion. Archives, Libraries and Intellectual Discourse He had also, says he, such a library of ancient Greek books, as to exceed in that respect all those who are remarkable for such collections; such as Polycrates of Samos, and Pisistratus who was tyrant of Athens, and Euclides who was himself also an Athenian, and Nicorrates the Samian, and even the kings of Pergamos, and Euripides the poet, and Aristotle the philosopher, and Nelius his librarian; from whom they say that our countryman Ptolemaeus, surnamed Philadelphus, bought them all, and transported them with all those which he had collected

BLACKWELL AND CRANE

415

at Athens and at Rhodes to his own beautiful Alexandria. (Athenaeus, Deipnosophistae 1.1, tr. Yonge)

Our varied conceptions of a library are both descriptive and prescriptive: these conceptions shift as material culture changes the methods with which we can manage information. In the GrecoRoman world, Alexandria had the most famous library and every lover of Greek literature sighs to think of the tragedies of Aeschylus, Sophocles, and Euripides, the poems of Sappho and the other works that once lay among its holdings and are now lost. The library at Alexandria was based upon miraculous technologies such as papyrus production and sea-born travel as well as writing.44 Popular conceptions of institutions such as libraries evolve along with the capabilities of their enabling technologies. The ancient library at Alexandria was not the instantiation of a Platonic ideal but the best use of the most advanced methods of the time. The library at Alexandria brought texts from around the Greek world into a single location. In the industrialized world, we have used industrialized print technologies to create hundreds of large libraries around the world, in effect protecting long-term access by maintaining multiple copies of the same work in widely separate locations. In the digital world we can not only create far more numerous copies and greater redundancy but our libraries are no longer inherently limited to physical locations.45 They can at any point reach any point on the earth. Twenty-first century collections become libraries only insofar as they fulfill the need to provide access over time and across space. Long term preservation and global access are foundational challenges for our new information infrastructure.46 The passage quoted attributes to an intellectual of the second century CE the claim that he had assembled an unparalleled collection of ancient Greek books. Two features from the underlying

(Berti 2007). For more on this theme see (Campell 2006) and (Pomerantz 2007). 46 This issue has been discussed by many, including (Smith 2008) and (Johnson 2007). 44 45

416

CHANGING THE CENTER OF GRAVITY

Greek are worth noting. First, no word corresponding to "library" actually appears: the Greek phrase (bibliôn ktêsis) describes the "possession of books" and does not designate either a place or an organization. Second, the passage above speaks in terms of individuals and collectors. The one exception, Nelius, is not a librarian: the Greek text probably includes an error but the term applied to Nelius (diatêrêsanta) states that he preserved the books of Aristotle and does not designate a generalized occupation such as the term librarian implies. We have left the nineteenth century translation unchanged to illustrate how easily we all project the categories of the present into sources from the past. A collection of hand-written documents, however, did not fit the dominant conceptions of libraries that took shape in print culture. We still call the ancient manuscript collections of Europe libraries because they bore this name, but in the massive libraries that emerged in the 19th century manuscripts, pamphlets and everything that did not fit the exacting demands of academic publication was preserved in special collections and archives. There, these documents would await the scholar who would cull them for information or create printed editions of them that could circulate and play an active role in the mainstream of intellectual life. For each surviving ancient text of Greek and Latin the editio princeps, the first printed edition, no matter how problematic its contents, represented a milestone and a new birth, marking the transition from handwritten manuscript into the new technology of print. Works still available only in manuscript were, in print culture, the material for published editions and printed facsimiles. They had not yet been published in print and thus were not yet a part of the citable record upon which general human discourse could depend. In the past decade, the academic library system has quietly shifted again. The print libraries of the 19th and 20th century have, in effect, become the archives of the 21st century, as publication and discourse in the most heavily supported disciplines have shifted entirely to a digital medium. The debate about print and digital information may continue but the infrastructure of mainstream intellectual discourse is now digital. The hotter the scientific discipline, the shorter the half-life of its publications – the last five or ten years of published material is enough to support many and probably most cutting edge research projects. Biologists studying changes in flora and fauna need access to as much historical data as

BLACKWELL AND CRANE

417

possible – for them observations from the 18th century provide foundational data. The Biodiversity Heritage Library may be the last major historical collection to be digitized within the sciences.47 With this project, the last major community of scientists is leaving the print world – and even these scientists maintained their own separate print library infrastructure: all ten of the institutions participating in the Biodiversity Heritage Library draw on specialized libraries that were already distinct from the libraries upon which humanists depend (e.g., the Harvard University Botany Libraries rather than Widener Library).48 The disciplines in which the advanced nations invest the most now, in effect, print what they need on demand. But just because information is on-line does not mean that that information has exploited the full potential of the digital medium. The debate has shifted instead to the question of open vs. closed access. The extraordinary cost increases for scientific journals have done more than anything else to drive the principle of open access — roughly one quarter of the entire acquisition budget for the Tufts University library in 2007, for example, went to a single scientific publisher, which does not invest any significant sums in the research that it publishes.49 In 2008, decades of rhetoric fi-

See http://www.biodiversitylibrary.org/ American Museum of Natural History (New York, NY); The Field Museum (Chicago, IL); Harvard University Botany Libraries (Cambridge, MA); Harvard University, Ernst Mayr Library of the Museum of Comparative Zoology (Cambridge, MA); Marine Biological Laboratory / Woods Hole Oceanographic Institution (Woods Hole, MA); Missouri Botanical Garden (St. Louis, MO); Natural History Museum (London, UK); The New York Botanical Garden (New York, NY); Royal Botanic Gardens, Kew (Richmond, UK); Smithsonian Institution Libraries (Washington, DC). 49 (Mobley 1998) and (Panitch 2005). For other on-line sources on the serials crisis, see (Parrott 2004). The phrase "serials crisis" is sufficiently well-established that it has spawned a Wikipedia entry (http://en.wikipedia.org/wiki/Serials_crisis). 47 48

418

CHANGING THE CENTER OF GRAVITY

nally led to action.50 Even under a pro-business Republican administration, the status quo has been intolerable. In April 2008, the National Institutes of Health (NIH) instituted an open access legal mandate that all publications produced with NIH support be deposited in the open access PubMed repository within twelve months of publication.51 The massive library collections at Harvard University have been a magnet for scholars and the university has traditionally been quite conscious of the investment it has made and the advantages which that investment confers upon it – the Boston Library Consortium is often described as "everyone but Harvard." Nevertheless, Harvard University surprised many observers by taking a dramatic stance in favor of open access. The Faculty of Arts and Sciences at Harvard University voted in February 2008 "to give the University a worldwide license to make each faculty member's scholarly articles available and to exercise the copyright in the arti-

Crane first heard a report about the serials crisis and a presentation arguing that publishers were gouging the market during a meeting held by the Harvard library in 1988. 51 At http://www.library.cornell.edu/nihmandate/index.html, Cornell University Library posted this summary of the NIH mandate: "Recipients of funding from the National Institutes of Health (NIH) should be aware of a new reporting requirement (http://grants.nih.gov/grants/guide/notice-files/NOT-OD-08-033.html) that went into effect on April 7, 2008. Principal investigators must ensure that electronic versions of any peer-reviewed manuscripts arising from NIH funding and accepted for publication after that date are deposited in PubMed Central (PMC), NIH's digital archive of biomedical and life sciences journal literature. Full text of the articles will then be made freely available to the public no later than 12 months after publication. The requirement applies to any NIH direct funding, including grants, contracts, training grants, subcontracts, etc. In addition, beginning May 25, 2008, anyone submitting an application, proposal, or progress report to NIH must include the PMC or NIH Manuscript Submission Reference Number when citing applicable articles that arise from their NIH-funded research." 50

BLACKWELL AND CRANE

419

cles, provided that the articles are not sold for a profit."52 The ruling automatically applies to all faculty publications and individuals must "request a waiver of the license for particular articles where this is preferable" – faculty cannot, according to the language of the press release, simply refuse to exempt themselves but must request waivers on a case by case basis. Steven E. Hyman, Provost at Harvard University framed the new policy in terms of responsibility: "The goal of university research is the creation, dissemination, and preservation of knowledge. At Harvard, where so much of our research is of global significance, we have an essential responsibility to distribute the fruits of our scholarship as widely as possible." Harvard is, of course, only a single institution but the actions of its faculty and administration provide a powerful example of how conventional thought has begun to shift. Google may ultimately solve the problem of access to the earlier print record. Through its Google Books project, Google has already digitized millions of books (and a striking amount of 19th century classical scholarship).53 The University of Michigan, for example, has entered into a partnership to digitize the entire print collection of the University Library – collections which "number over 7 million volumes, covering thousands of years of civilization, from papyri to reports of the latest advances in science and medicine."54 The legal agreement between Google and the University of Michigan contains a clause entitled "searching free to the public" that asserts that Michigan content be made available at "no direct cost to end users."55 Google is not asserting open source – Google

52 Harvard University News Release, February 12, 2008: http://www.fas.harvard.edu/home/news-and-notices/news/pressreleases/release-archive/releases-2008/scholarly-02122008.shtml. 53 See http://books.google.com/ 54 http://www.lib.umich.edu/mdp/: accessed August 16, 2008; http://www.lib.umich.edu/libinfo/stats.html: accessed August 16, 2008. 55 "Searching Free to the Public: Google agrees that to the extent that it or its successors make Digitized Available Content searchable via the Internet, it shall provide an interface for both searching and a display of search results that shall have no direct cost to end users. Violations of this subsection, 4.3, not cured within thirty days of notification by U of M

420

CHANGING THE CENTER OF GRAVITY

does not allow commercial competitors to build services on top of the books that it paid to digitize and legal issues remain to be resolved. Nonetheless the logic behind the vast Google digitization effort moves academia much farther towards open access for a global audience.56 Classicists have already begun taking steps to make their core primary materials available in the interoperable formats and open licenses needed for teaching and research in a digital world. The Perseus Digital Library released the TEI-compliant XML source files for all of its primary sources and accompanying translations in March 2006 under a Creative Commons license. Harvard’s Center for Hellenic Studies (CHS) has also undertaken to extend this effort and announced in August 2008 a plan to create a digital library of new TEI-compliant XML editions for the first thousand years of classical Greek, including "at least one version of every Greek text known to us from manuscript transmission from the beginning of alphabetic writing in Greece through roughly the third century CE."57 Support from the Mellon Foundation has allowed Perseus to begin building a comprehensive collection of scanned critical editions for every major Greek and Latin author. The initial results of this work are already available for public download at the Internet Archive and under a license that allows anyone to create new derivative works using their own OCR or text mining software and publishing the results in their own services.58

shall terminate U of M's obligations under section 4.4.": "COOPERATIVE AGREEMENT between Google and the Regents of the University of Michigan/University Library." Retrieved 8/16, 2008, from http://www.lib.umich.edu/mdp/umgooglecooperativeagreement.html 56 Google’s approach to mass digitization and open access has been both critiqued and defended for recent examinations of some of the major issues see: (Grafton 2007) and (Kaufman 2007). 57 http://chs.harvard.edu/, accessed September 30, 2008. 58 For example, we have started to have a number of early editions of Thucydides scanned and made available such as this edition from 1735 http://www.archive.org/details/tucidideistorico00thucuoft

BLACKWELL AND CRANE

421

If we are to understand what form we would like our libraries to assume, we must first consider what we expect from the publications that will populate these libraries. Features of Publication in a Digital World Socrates: And every word, when once it is written, is bandied about, alike among those who understand and those who have no interest in it, and it does not know with whom to speak or not to speak; when ill-treated or unjustly reviled it always needs its father to help it; for it has no power to protect or help itself. Phaedrus: You are quite right about that, too. (Plato, Phaedrus 275de, after Fowler) Scholars have written about the ancient world since antiquity itself, and we build upon more than half a millennium of the scholarship that print made possible. A great deal of material about the GrecoRoman world exists in digital form, but only a small subset of that material can fulfill its potential in a digital world. The essential criteria for true publication are different in the digital world because the digital world supports services that are not feasible in print and can reach audiences millions of times larger than academic print publications could reach. The fact that a resource exists in a digital format is a necessary but not sufficient condition: just because an object of potential relevance to classics is digital does not mean that it is useful. Not only the print volumes that sit upon our library shelves but the digitized publications to which commercial entities sell access have all become, within the digital world, archival materials, tied to a few discrete points on the earth and membership in specialized organizations. Whatever the merits of their content, these essays are important because, despite the vast body of existing scholarship, these essays are among the first original works of classical scholarship to meet the minimal criteria for publication in a digital age. Scholarly publication in a digital age must satisfy at least the following four conditions. These four conditions overlap, of course, with those familiar from five centuries of print culture, but,

422

CHANGING THE CENTER OF GRAVITY

of course, they also must adapt to the digital foundation on which all shared intellectual expression already depends. First, the content must be of interest to someone other than its producers. In academia, we have developed peer review as an instrument to assert that a particular intellectual production has sufficient value to warrant a permanent place in the scholarly record and we used traditional peer review in this collection as well. Peer review is, of course, no guarantee – and readers will come to their own conclusions about what is published here, as they do about everything that they read. Other models exist to achieve the same goal and we should not confuse the instrument of peer review with its purpose.59 Second, the content must be in a format that we can preserve and use for long periods of time. Print culture developed for the organization of books and articles conventions that have proven so successful as to become almost invisible: we take tables of contents, chapters, footnotes, indices, bibliographies and other conventions for granted. In a digital environment, machines are the first and essential readers of all published materials – where more is written than any one person can digest, we depend upon what machines can extract to identify those few objects on which we can focus the limited attention and intellectual capacity of the human brain. The articles in this collection express their basic structures in a standardized format that machines can understand. More sophisticated documents will surely emerge but these are likely to enhance, rather than abandon, the structures within this collection. By investing in the XML markup we have conformed to the best practices of the present so that the digital librarians in future generations can manage these articles within their digital collections. Third, the content must have at least one reliable long-term home. In print publication, authors needed publishers to put their work into circulation. Publishers committed, however, only to pro-

For a look at how institutional repositories could help to create new models for journals and peer review systems, see (Bankier 2008); for a good discussion of the need to reevaluate all aspects of scholarly publishing see (Hahn 2008). 59

BLACKWELL AND CRANE

423

vide very short-term access. Preservation in print culture has always been the task of libraries. Even if war or natural catastrophe destroyed one library, other libraries preserved separate copies of each work and these could be reprinted or reproduced with increasing facility. In a digital age, distribution is trivial – any web page could in early 2008 reach more than half a billion machines.60 Preservation is, however, a major challenge. Classicists know that they can usually track down copies of the most obscure 19th century dissertations somewhere because libraries have worked hard to preserve academic publications. A 2005 study found that half of the URLs cited in a 1995 issue of D-Lib Magazine, a major venue for publication, no longer worked (McCown 2005). Libraries have, however, moved to address this situation and have created institutional repositories with which to fulfill in this new digital world their ancient mandate of preserving what they collect.61 The articles in this collection are part of the permanent collection of the University of Illinois at Urbana Champaign (UIUC) libraries. Fourth, the content can circulate freely – it is, indeed, truly public and thus published. A decade ago, this idea was radical and unnerving to many of us, but the Stoa Publishing Consortium always supported open access from its creation in 1997. In the quotation that opens this section, Plato’s Socrates expresses anxiety that information, once represented in a physical medium is separate from its producers and begins a life of its own. In the end, we have overwhelming reasons to leave these anxieties behind. First, we need both our primary and secondary sources to be open for analysis by as many systems as possible if we are to exploit the full power of the digital world and to fulfill our professional obligations as scholars. Second, each scholar, department, discipline, college, and university is, at some level, locked in a Hobbesian war of all against all. College and university web sites are very expensive to

The Internet Systems Consortium published for January 2008 an estimate of c. 600,000,000 machines: https://www.isc.org/, accessed August 16, 2008. 61 For more on the growing recognition of the importance of institutional repositories, see (Hockx-Yu 2006). 60

424

CHANGING THE CENTER OF GRAVITY

produce and maintain but they are freely accessible because each institution is competing for exposure. Subscription revenues do not pay for scholarship. Third, we have plenty of money in the system to pay the costs. During 2005, the 123 members of the Association of Research Libraries invested more than 1.1 billion dollars in their collections.62 Our interest lies in maximizing exposure. We need to shift from importing the products of third parties and towards exporting the productions of our scholars, departments, and disciplines. Some of the authors in this collection remember hearing that it would be impossible for libraries to provide access to electronic materials – they didn’t have enough resources to collect print. Likewise, we heard that universities could never support web sites – they were too expensive and the budget was already overstressed. Nothing can ever change – but everything always does in the end. The first three reflect a narrowly construed Hobbesian model of self-interest, but they all support the fourth and most important reason. We have a moral obligation as scholars to preserve, expand and disseminate, as broadly as possible, as much of the human record to as much of humanity as possible. For this reason, we have adopted a Creative Commons license not only for the publications in this collection but for all of our work. Peer review, the Digital Humanities Quarterly (DHQ) XML style-sheet, institutional repositories and Creative Commons licenses are the four instruments by which we address ideals of content, form, stability and openness inherent in true digital publication.

THE SCAIFE DIGITAL LIBRARY (SDL) The advice of Themistocles had prevailed on a previous occasion. The revenues from the mines at Laurium had brought great wealth into the Athenians' treasury, and when each man was to receive ten drachmae for his share, Themistocles per-

62 ARL Statistics 2005-06: http://www.arl.org/bm~doc/arlstats06.pdf, accessed September 30, 2008.

BLACKWELL AND CRANE

425

suaded the Athenians to make no such division but to use the money to build two hundred ships for the war, that is, for the war with Aegina. This was in fact the war the outbreak of which saved Hellas by compelling the Athenians to become seamen. The ships were not used for the purpose for which they were built, but later came to serve Hellas in her need. (Herodotus 7.144, tr. Godley)

Themistocles somehow convinced his fellow citizens to forego a windfall payment and to invest instead in a navy. Even then, the nominal object of the navy – a war with the nearby island of Aegina – masked the vastly greater, but inconceivably distant, Persian threat. Aegina looms as a presence visible from the Acropolis. Herodotus elsewhere (Hdt. 5.53) reports that the Persian capital at Susa was a three-month journey from Ephesus on the West coast of modern Turkey. While most of us remained focused upon publishing our own work under our own name and building digital resources that would serve our own projects, Ross Scaife early realized that there were bigger issues at stake than a few drachmas of scarce prestige in a small academic field. The idea behind the Scaife Digital Library (SDL) reflected Ross’s own long-term interests: a 1997 grant from the Fund for the Improvement for Postsecondary Education helped Ross Scaife found the Stoa Publishing Consortium to pioneer new models of publication to enhance learning and intellectual life.63 The SDL is a new, virtual collection designed to support the digital publications that meet the four criteria outlined above. The first plans for the SDL were presented at the beginning of a two day workshop on "What do you do with a million books?," Humboldt University in Berlin on March 17, 2008, two days after Ross Scaife died in Kentucky. On August 6, 2008, the Institute for the Study of the Ancient World, based at New York University, funded a planning meeting hosted at Harvard’s CHS in Washington, DC.

63

(Marchionini 2000).

426

CHANGING THE CENTER OF GRAVITY

The first release of the SDL was announced on November 6 of the same year, at the TEI Annual Meeting at King’s College London. The SDL contains durable digital objects that satisfy the four criteria of digital publication outlined above: 1. The content has been judged worth preserving. Peer review is the most established mechanism to establish this judgment. 2. The content is in a defined, approved format suited for preservation over long periods of time. Examples include XML documents encoded according to the Guidelines of the TEI and of EpiDoc.64 3. Each object has a long-term institutional home separate from the individual or group that produced it. Digital repositories at Brown University, NYU, Tufts University, and UIUC among other institutions currently store the initial objects in the SDL. 4. Each object is available under an open license. Where authors create documents to present a particular scholarly voice at a particular time, an open access license should allow third parties to quote and republish the document but not to change its content. Where authors create works designed to encapsulate general and evolving points of view (e.g., lexica, commentaries, editions), then an open source license is necessary so that third parties can, in fact, modify the content. In this case, versioning systems track and identify who was responsible for each change. The SDL is simultaneously an idea, a concrete collection, and an organization to produce new content. Any digital objects that satisfy the four criteria of publication automatically belong to the SDL– thus every article already published by the DHQ can be treated as part of the SDL because each DHQ article satisfies all four criteria. Ross Scaife was a classicist and classics offers the ini-

64 http://epidoc.sourceforge.net/; see also Hugh Cayless’ piece in this collection.

BLACKWELL AND CRANE

427

tial center of gravity for the SDL, but we exclude nothing relevant to the humanities. The SDL is also a concrete collection: it includes a catalogue of known objects and the information needed for automated services to collect each digital object from its home repository. We hope to see objects from the SDL in a range of locations and organizations: with Internet giants such as Google, at particular computational and storage Grids, and on local computing clusters. Finally, the SDL is an organization designed to produce new content. The production of new SDL content can be a simple decision that any digital object produced by a particular third party (e.g., DHQ) automatically becomes part of the SDL – in this, the SDL mirrors the standing subscriptions by which libraries traditionally purchased every publication from particular publishers in print culture. But the SDL, however, also provides editorial review of original content. The SDL does not, however, provide services for end users. The SDL may include the code for those services that only humanists can be expected to provide (e.g., an advanced morphological analyzer for classical Greek) but the SDL does not plan to provide those services. The SDL provides a long term home for the objects which others can analyze or make accessible in various systems. We require that each object have an approved format so that as many groups as possible will develop the largest possible number of services with which to make SDL objects useful to the widest possible audience. In addition, we require that each object have a long term home, which in effect, states that we have entrusted libraries to apply their traditional functions of preservation and access for SDL objects. The requirement that each object have an open license reduces our dependence on any one institution: we hope that there will be many copies of each object from the SDL, both under formal preservation systems (such as LOCKSS) and in thousands of informal collections.65

65 LOCKSS stands for "Lots of Copies Keeps Stuff Safe" a program "based at Stanford University Libraries, is an international community initiative that provides libraries with digital preservation tools and support

428

CHANGING THE CENTER OF GRAVITY

The SDL thus answers questions of production and preservation but questions remain. The digital environment allows us to rethink not only publication but who can publish and how we divide labor in the scholarly world.

THE WORK OF SCHOLARSHIP: NEW DIVISIONS OF LABOR IN THE WORLD OF GOOGLE AND WIKIPEDIA Theban Herald:

Who is the despot of this land? To whom must I announce the message of Creon who rules over the land of Cadmus, since Eteocles was slain by the hand of his brother Polyneices, at the sevenfold gates of Thebes. Theseus: You have made a false beginning to your speech, stranger, in seeking a despot here. For this city is not ruled by one man, but is free. The people rule in succession year by year, allowing no preference to wealth, but the poor man shares equally with the rich. (Euripides, Suppliants 399-408, tr. Kovacs) Master Tyndale happened to be in the company of a certain divine, recounted for a learned man, and, in communing and disputing with him, he drove him to that issue, that the said great doctor burst out into these blasphemous words, "We were better to be without God's laws than the pope's." Master Tyndale, hearing this, full of godly zeal, and not bearing that blasphemous saying, replied, "I defy the pope, and all his laws," and added, "If God spared him life, ere many years he would cause a boy that driveth the plough to know more of the Scripture than he did." (Foxe 1965)

The papers in this collection have focused upon the practices of scholarship. In this section we consider the work of scholarship and the associated division of labor. The center of gravity for intellectual life has not only shifted, decisively and forever, to a digital

so that they can easily and inexpensively collect and preserve their own copies of authorized e-content." Retrieved from http://www.lockss.org/lockss/Home/.

BLACKWELL AND CRANE

429

medium but the relative position of professional humanists has changed as well. To some extent, that division of labor has already begun to shift. The scholarly practices to which we award Phds, tenure and promotion may have remained largely unchanged but new practices of intellectual life have exploded onto the scene. Most of us like to think of ourselves as a progressive force, but we, in the eyes of many, more closely resemble the bullying Theban Herald of Euripides’ Suppliants. Worse, we may appear to have become like the Athens of Thucydides, a turannos polis, a city-state in which only holders of Phds or even those with professional academic appointments alone have the right to speak and contribute. The Tyndales of the twenty-first century maintain blogs, work for the Open Content Alliance (OCA), write for Wikipedia, produce content under Creative Commons open licenses and drive explosive growth of other, novel forms of intellectual production. All of those who have written for this collection feel a profound obligation to address this gulf between the work that we do as professional scholars and the messy, passionate, unruly, intense streams of activity that have carried Wikipedia and other efforts so far. Professional academics have played, insofar as we can tell, almost no direct role within this historic movement. The authors of this conclusion do not know of any academic who has included Wikipedia along with their conventional publications in their yearly reviews. We do know that, as of the end of August 2008, Wikipedia contains more than two and one half million entries. And we know that this resource has proven astonishingly useful, its flaws real but, when systematically analyzed, no worse than those of conventional, centralized reference works.66 No one knows how much labor the various language versions of Wikipedia have absorbed – in part because volunteers have contributed the vast majority of the labor and volunteers do not track billable hours. Wikipedia does cost money — the 2005 budget for Wikipedia was $739,200, while the overall Wikimedia foundation

66 See (Rosenzweig 2008) (also available http://chnm.gmu.edu/resources/essays/d/42) and (Giles 2005).

at

430

CHANGING THE CENTER OF GRAVITY

reported a budget of 4.6 million dollars for 2007-2008.67 The aggregate cost will thus repesent well under 40 million dollars (i.e., which would be the cost if they had spent $5,000,000/year each year since 2001 when they began). Clay Shirky, however, recently estimated that Wikipedia represented 100,000,000 hours of labor – thus representing at least 1 billion dollars in labor. The ratio of paid to volunteer labor is thus at least 20 to 1, and probably very much higher. The National Endowment for the Humanities (NEH), by contrast, requested a budget of less than 145 million dollars for fiscal year 2009 – it would take almost seven years of the entire NEH budget to produce $1,000,000,000. The labor power unleashed by this one new mode of intellectual production is extraordinary. Scholarly publications incorporate a great deal of accumulated labor. In classics, the language barriers make such embedded labor relatively easy to identify – classicists need expertise in the Greek and Latin languages, familiarity with the ancient core texts of at least one of these languages, and enough knowledge to work comfortably with book-length studies in English, French, German and Italian. If we consider four years of undergraduate education and six years of doctoral studies as one model of scholarly apprenticeship, each scholarly publication represents years of embedded labor. When a faculty member devotes a month or two in the summer to a new publication, we thus need to consider not only the hundreds of hours invested during that summer but all the years of work on which that scholar is drawing. Wikipedia and other forms of community-driven intellectual production ultimately increase the audience for – and thus the realizable value of – advanced scholarship. Professional academics need to decide how they wish to respond to this vast audience. Many of us are products of a print culture in which our publications simply could not reach beyond a few hundred or, at best, thousand research libraries. We had no reason to write for audi-

67 See http://wikimediafoundation.org/wiki/Budget/2005 and http://wikimediafoundation.org/wiki/Planned_Spending_Distribution_2 007-2008.

BLACKWELL AND CRANE

431

ences that our publications would never reach. Furthermore, the professionalized incentives of academia rewarded us for producing work that would impress our colleagues and facilitate tenure, promotion, and other signs of academic success. We now have, however, radically new technologies and social practices with which to advance the intellectual life of humanity as a whole. Twentieth century print culture produced scholarship that required a great deal of training to produce and almost as much training to understand, much less appreciate. We now see a world emerging with much lower barriers for entry. x Tangible contributions. Automated methods can do an immense amount but they benefit as well from very large amounts of skilled human labor. Many basic tasks reflect the strengths of human intelligence and provide opportunities for students and non-professionals to contribute tangibly to the infrastructure on which the study of classics depends. The essays by Blackwell and Martin and by Elliott and Gillies document areas in which students can quickly begin contributing tangibly to our understanding of the ancient world. Bamman and Crane describe the emerging role of syntactic databases — treebanks – for the study of classical Greek and Latin. Even if we have a treebank with millions of words already analyzed to serve as a training set for an automated syntactic analyzer, the best automated systems do not, at present, provide more than 87 or 88% accuracy – enough for many analytical purposes but not perfect. Greek classes at Brandeis, Tufts, Furman and elsewhere have already begun to integrate the production of syntactic data into their curricula. The method is straightforward. Treebanks use their tags and methodologies but, in essence, the production of treebanks depends upon ancient practices of reading – we need to identify the main verb, its subjects, objects, etc. Two students can, for example, analyze each sentence, the class can then discuss the points at which they differ, and produce carefully analyzed sentences that may include variant interpretations. When given a particular set of tags and relationships most readers will agree on the syntactic relationships between most words in most texts, however, some Greek

432

CHANGING THE CENTER OF GRAVITY

x

sentences support multiple interpretations, whether because we are not sure what the author originally wrote or because the text that we have reconstructed is fundamentally ambiguous. Ultimately, the syntactic analysis for some words in our surviving texts remains an object of research. Other tasks that are in most cases straightforward can be the object of research as well: in some cases we cannot determine to which Antonius or Cleopatra a particular passage alludes and we depend upon skilled prosopographical analysis to rank the possibilities. We find place names where we do not know for sure the original location. Word sense disambiguation depends upon the senses that we ascribe to a word and thus upon semantic analysis that can become complex for common words. We thus see a gradient of tasks. In many cases, students and undergraduate classes can improve upon the results of automated processes and/or provide the initial training data from which, in turn, automated methods can analyze much larger bodies of material. In some cases, the answers to conceptually simple questions (e.g., who is the Antonius in this passage? What is the structure of this sentence?) are not immediately clear and have historically provided scope for some of the most skillful classical scholarship. The patterns visible from the many passages that are not controversial will, when aggregated and analyzed, allow us to place discussions of ambiguous instances on a more explicit and quantified footing. We may even find scholarly consensus advancing as new scholarly instruments, developed in large measure by students and the general public, allow us to shed light upon old problems. Thus, we have a space that provides ample room for contributors at a various levels of expertise. Undergraduate research. Once we have large databases of information we can begin to see patterns that were not visible before. We rely upon automated methods of analysis to direct our attention to interesting patterns and thus to serve as the starting point for, rather than a conclusion to, analysis. It is important to emphasize that we do not need perfect data to identify major patterns –a recent

BLACKWELL AND CRANE

433

study conducted by David Bamman showed that even when automated syntactic analysis generated results that were as low as 50% accurate, some significant linguistic patterns were visible despite the noise of a 50% error rate.68 New sources of data open up possible research topics to which our advanced undergraduates can realistically aspire. The Homer Multitext Project, for example, has published high resolution images of the most important manuscript of Homer, the 10th century Venetus A, making visually accessible scholia and readings that have never been published, much less translated. Students are well able to produce initial diplomatic editions with basic contextual information and English translation. Published in standard formats under open licenses and in long term institutional repositories, such works can provide the foundation for a new generation of editions. Generations of students can productively provide the intellectual apparatus needed to understand the detailed page images already being produced in Europe and North America for manuscripts of Homer and other classical authors, fundamentally changing the role that these source materials can play in intellectual life. Likewise, the creation of treebanks allows us to see patterns of word usage, linguistic practice and individual style. Even now, as we develop large automated treebanks, students can create treebanks for individual works and control samples to produce original research: thus, given a treebank and the ability to find Greek words corresponding to English, students could undertake valuable systematic studies that were not practical before (e.g., the semantics of words for "power" in Herodotus and Thucydides). The results of their research can be published through our university repositories, connected to every passage on which they shed light, and

68

(Bamman 2008).

434

CHANGING THE CENTER OF GRAVITY

preserved, as permanent contributions, long after their youthful authors have passed from the scene. It would be hard to overstate the possible opportunities of practical undergraduate research for classics and the humanities in general. The field of classics — and, indeed, every field within the humanities – needs to adapt itself to the challenges and opportunities, some realized, others emergent though visible in outline, that this digital environment has thrust upon us. First, all classicists are digital classicists. Insofar as the practices of our work advance research projects imagined within the limitations and for the tiny academic audiences of print culture, we are antiquarians. We may not believe in particular ideas such as the "judgment of history," but we do believe in conventional ideas and are confident that the implicit assumptions about what constituted scholarship in the twentieth century will give way to new conventional ideas. Each of us working now for an audience in the future is making bets about what those conventional ideas will assume. The authors of this conclusion are not so sanguine as to believe that the culture and languages of ancient Greece and Rome will inevitably flow outwards into the hearts and minds of humanity as vigorously as we hope. Technology constrains and enables the space within which we move. How well and how quickly we in classics and the humanities adapt to the niches within this space depends upon the decisions that we make (however unpredictable the outcomes of those decisions may be). We do not know yet what common technological knowledge classicists must share. We cannot all be accredited system administrators or application programmers. On the other hand, it is hard to accept complaints that the TEI Guidelines or the underlying structures of treebanks are too complicated for scholars who work with six languages. The services outlined above can use textual and syntactic markup to enable new forms of scholarship and of reading support but such data structures are, fundamentally, surface expressions of traditional ideas. Habits from the past and anxiety about the future are the major barriers. Those who have succeeded in the traditional tasks of classical philology will, if they can muster the necessary labor, find themselves in a world that allows them to pursue their traditional tasks more fully. If they can read Pericles’

BLACKWELL AND CRANE

435

Funeral Oration in the original Greek, they are well able to master any general technological system. Classicists need only to exploit the analytical tools and conventions of intellectual discourse available to them to achieve their goals. For us, the blogs, wikis, assorted web pages and other digital tools simply challenge us to adapt the complementary goals of rhetorical power and intellectual discipline. We hope that others will more fully realize these goals than has been possible so far for us. Second, classicists need some scholars who have more advanced knowledge of the technology. We do not have the resources to sustain a subfield such as bioinformatics, but the broadening textual collections and treebanks now starting to emerge for Greek and Latin build upon many of the same techniques used to find patterns in the human Genome. The most important philologists now at work may well be the classicists who have joined the field of computer science and are now laying the foundations on which all philological research will depend. Rising scholars such as David Smith, David Mimno, Ryan Gabbard, and Gabriel Weaver, originally trained as classicists, were unable to conduct work in machine translation, text mining and general natural language processing that is foundational for classical studies. We may not be able to imagine the shape that our field will assume in the centuries to come, but future change does not absolve us from the obligation to understand what is already possible. None of the PhD programs with which we are familiar has addressed the challenge of producing and supporting those scholars who can show us how to pursue the ancient goals of our field in the rapidly shifting technological spaces within which we live. Third, we need new institutions to provide access to the results of our work. Neither the libraries nor the publishers of the early twenty-first century serve the needs that emerge in this collection. While libraries may survive and indeed flourish as an institution, they will do so by subsuming and transforming the functions that we entrusted to publishers in print culture. We need a small number of library-publishers that can help classicists produce new content and then maintain that content over time. And that content must include not only relatively static documents but, at least for now, a minimal set of executable code: every discipline will probably need at least some services that only experts in the field can create and that are part of the field’s core infrastructure. Morpho-

436

CHANGING THE CENTER OF GRAVITY

logical analysis and lemmatization, mentioned above, are fundamental processes that should be applied automatically to every digital word of Greek and Latin. Classicists may need to develop these systems, but the systems, once developed, need to be preserved as active services along with the XML texts, 2d images, GIS datasets and stable collections. The seeds of these new organizations are visible in the Digital Knowledge Center in the Johns Hopkins University Library system and the California Digital Library, but we do not yet see in operation a mature model that can serve our needs in the present and expand over time. The Perseus Digital Library thus still finds itself compelled to maintain its own servers as best it can, maintaining services that were innovative a decade ago but are still beyond the capacity of any systems with which we are familiar. Google is moving very quickly in this vacuum. The academic library system failed to address the legal, technical, and financial challenges of converting its retrospective print holdings into digital form. Google Books is rapidly filling the vacuum of collections and services that libraries left. Perhaps it was impossible for our library system, rich in the aggregate, to organize itself. If so, libraries may evolve into a handful of repositories, acting as wholesalers to provide the content by which the Googles, Microsofts, Yahoos and their brethren support the intellectual life of humanity. If the commercial world can generate revenue by providing access to content that anyone can download, then the market may function well enough to provide universal access.69

69 In the public information Google reports (http://books.google.com/googlebooks/history.html, accessed August 31, 2008): "As part of this fact-finding mission, Larry Page reaches out to the University of Michigan, his alma mater and a pioneer in library digitization efforts including JSTOR and Making of America. When he learns that the current estimate for scanning the university library's seven million volumes is 1,000 years, he tells university president Mary Sue Coleman he believes Google can help make it happen in six." If the source for this figure had imagined the ARL libraries alone dedicating 1% of the $1,000,000,000 collections budget into digital conversion, the $10,000,000 would pay for roughly 300,000 books per year or roughly 16 years for

BLACKWELL AND CRANE

437

CONCLUSION: BLOOD FOR THE SHADES I see here the spirit of my dead mother; she sits in silence near the blood, and does not look upon the face of her own son or speak to him. Tell me, prince, how she may recognize that I am he. [145] So I spoke, and he straightway answered, and said: Easy is the word that I shall say and put in your mind. Whomsoever of those that are dead and gone you shall allow to draw near the blood, he will tell you true things; but whoever you refuse, he surely will go back again. [150] So saying the spirit of the prince, Teiresias, went back into the house of Hades, when he had declared his prophecies; but I remained there steadfastly until my mother came up and drank the dark blood. At once then she knew me. (Odysseus and Teiresias, Homer Iliad 11.145-153, after A. T. Murray) Fifty sons I had, when the sons of the Achaeans came; nineteen were born to me of the self-same womb, and the others women of the palace bore. Of these, many as they were, furious Ares hath loosed the knees, and he that alone was left me, that by himself guarded the city and the men, him you slew, just now as he fought for his country, even Hector. For his sake have I now come to the ships of the Achaeans to win him back from you, and I bear with me ransom past counting. Nay, have awe of the gods, Achilles, and take pity on me, remembering your own father. See, I am more piteous far than he, and have endured what no other mortal on the face of earth hath yet endured, to reach forth my hand to the face of him that hath slain my sons.

5,000,000 volumes with the Open Content Alliance Workflow. The library community simply did not think that its retrospective collections were worth the technical, political, and legislative trouble. It will be interesting to see how many observers, a generation from now, will view the leadership of the early twenty first century libraries with sympathy, much less admiration.

438

CHANGING THE CENTER OF GRAVITY So he spoke, and in Achilles he roused desire to weep for his father; and he took the old man by the hand, and gently put him from him. So the two thought of their dead, and wept. (Priam and Achilles, Homer Iliad 24.495-507, after A. T. Murray)

A new, digital infrastructure provides the explicit subject for this collection of essays. We can create now collections that are larger than any Ptolemy or Cleopatra could have imagined for their Alexandria. We have ever more sophisticated services that can analyze and combine these collections in new ways and even to generate the stuff of new knowledge. And the material systems on which these services are based simply did not exist half a century ago and cost 100,000 times less now than they did a quarter century ago.70 And if the essays published here have focused upon what we can learn from our textual record, these collections capture sound, images, and data that human hands alone can never transcribe. Indeed, the writing on inscriptions, papyri, and manuscripts now appear as images, open for humans to read and machines to analyze, ready to reveal long forgotten aspects of the living world that produced them. But if everything that we use as a tool is different, nothing that we truly value is new. Like Odysseus in the Underworld, we bring blood to the shades and seek, insofar as possible, to let those who have gone before us to converse with us in their own words. All of us who have studied literature in the academy understand that we can never fully understand our subjects – the very notion of understanding implies a fixity that does not suit complex human beings, filled with contradictory impulses and defined as much by their changing potential for actions as by anything they have done in the past. Priam and Achilles communicate in a single language and understand the cultural backgrounds from which each comes

In 1982, the Harvard Classics Department paid $34,000 for a 660 megabyte disk ($51 per megabyte). In October 2008, 1 terabyte drives – with more than 1,000 times the capacity of the 600 megabyte drive from 1982 – are available for under $500 (c. $.0005 per megabyte). 70

BLACKWELL AND CRANE

439

but each of them crosses a gulf as great as that which any mere quantity of time and space can pose. Their moment together has no material effect upon the great events around them. Each will soon suffer a violent death. Troy will fall and a massacre will follow. But the moment above has been powerful for many audiences over the course of almost three millennia, perhaps all the more powerful for the violence that surrounds it. The future of the past has never been brighter. The digital medium offers new methods with which to make Greco-Roman culture and classical Greek and Latin physically and intellectually accessible to audiences vastly larger and more diverse than was ever feasible in print. The culture of the Greco-Roman world and the languages of classical Greek and Latin can play a fuller role in the cultural memory of all mankind than ever before. The ideas and actions of those who lived in the Greco-Roman world and expressed themselves in Greek and Latin can begin to quicken hearts and fire minds that dream in Chinese, Hindi, and, in the end, every language of humanity. Each of us brings to bear the skills that we have acquired during the time that we have on this earth. Those skills and periods of time vary. Generations pass. Technologies change. Nations rise and fall. Languages fade away and transform themselves beyond recognition. But the memory of classical antiquity has endured over the millennia. All of us who have dedicated our lives to this field – whether we struggle with new technologies or contemplate the record of the past in more traditional ways – are privileged in the subject that we have chosen. We composed these essays in sadness at the loss of our friend, Allen Ross Scaife, but we send them forth in hope as we contemplate the future that Ross helped to make possible.

BIBLIOGRAPHY Atkins 2003 Atkins, Daniel E., et al. "Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure", 2003: http://www.nsf.gov/od/oci/reports/atkins.pdf. Bamman 2006 Bamman, D. and G. Crane. "The Design and Use of a Latin Dependency Treebank." In TLT 2006: Proceed-

440

CHANGING THE CENTER OF GRAVITY

ings of the Fifth International Treebanks and Linguistic Theories Conference: 67-78. Bamman 2007 Bamman, D. and G. Crane. "The Latin Dependency Treebank in a Cultural Heritage Digital Library." In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007): 33-40. http://dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.002. 00002. Bamman 2008 Bamman, D. and G. Crane. "Building a Dynamic Lexicon from a Digital Library." In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, 2008: 11-20. Bankier 2008 Bankier, J.-G., and I. Perciali. "The Institutional Repository Rediscovered: What can a University do for Open Access Publishing." Serials Review, 34: (2008): 2126. Berti 2007 Berti, M. and V. Costa. "Alexandria and the Mirage of a Million Book Library." Paper presented at conference "The World's Greatest Libraries: From Ancient Alexandria to the 21st Century." College of the Holy Cross, November 2007. Byrne 2007 Byrne, K. "Named Entity Recognition in Historical Archive Text." In ICSC 2007: International Conference on Semantic Computing: 589-596. Campell 2006 Campbell, J. D. "Changing a Cultural Icon: The Academic Library as a Virtual Destination." Educause Review, 41:1 (2006): 16-31. http://connect.educause.edu/Library/EDUCAUSE+R eview/ChangingaCulturalIconTheA/40602?time=1195 059427. Cardey 2006 Cardey, S. et al. "The Development of a Multilingual Collocation Dictionary." In Proceedings of the Workshop on Multilingual Language Resources and Interoperability, 2006: 32-39. http://www.aclweb.org/anthologynew/W/W06/W06-1005.pdf. Carpuat 2005 Carpuat, M. and D. Wu. "Word Sense Disambiguation vs. Statistical Machine Translation." In ACL '05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005: 387-394.

BLACKWELL AND CRANE

441

Crane 1991 Crane, G. "Generating and Parsing Classical Greek." Literary & Linguistic Computing, 6:4 (1991): 243-5. Crane 2005 Crane, G. and A. Jones. 2005. "The Perseus American Collection 1.0." http://dl.tufts.edu//view_pdf.jsp?urn=tufts:facpubs:gc rane-2006.00001. Crane 2006 Crane, G. and A. Jones. "The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th Century Newspaper Collection." In Proceedings of JCDL 2006: 31-40. http://dl.tufts.edu/view_pdf.jsp?pid=tufts:PB.001.001. 00007. Dawkins 1976 Dawkins, R. The Selfish Gene. Oxford, Oxford University Press, 1976. Dekhtyar 2006 Dekhtyar, A. et al. "Support for XML Markup of Image-Based Electronic Editions." International Journal on Digital Libraries, 6:1 (2006): 55-69. Deng 2006 Deng, Y., et al. "Segmentation and Alignment of Parallel Text for Statistical Machine Translation." Natural Language Engineering, 12:4 (2006): 1-26. Derrida 1981 Derrida, J. "Plato's Pharmacy." In: Dissemination (Chicago, University of Chicago Press, 1981): 61-84. Eder 2007 Eder, M. "How Rhythmical is Hexameter: A Statistical Approach to Ancient Epic Poetry." Digital Humanities 2007. http://www.digitalhumanities.org/dh2007/abstracts/x html.xq?id=137. Emerson 1971 Emerson, Ralph Waldo. "The American Scholar." In The Collected Works of Ralph Waldo Emerson, Volume 1: Nature, Addresses, and Lectures (Cambridge, MA: Harvard University Press, 1971). A digital version is, at the time of writing, available at http://www.apstudent.com/ushistory/docs1801/amrsc hol.htm (accessed October 5, 2008). Ernst-Gerlach 2008 Ernst-Gerlach, A. and G. Crane. "Identifying Quotations in Reference Works and Primary Materials." In Proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2008): 7887.

442

CHANGING THE CENTER OF GRAVITY

Feng 2006 Feng, S. and R. Manmatha. "A Hierarchical, HMMbased Automatic Evaluation of OCR Accuracy for a Digital Library of Books." In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 2006: 109-118. Foxe 1965 Foxe, John. Acts and Monuments, Vol.5. Arno Press, 1965. http://en.wikisource.org/wiki/The_Book_of_Martyrs/ Chapter_XII contains an abridged version of Foxe. Giles 2005 Giles, J. "Internet Encyclopaedias go Head to Head." Nature, 438 (2005): 900-901. Grafton 2007 Grafton, A. "Future Reading: Digitization and Its Discontents." New Yorker, November 5, 2007, http://www.newyorker.com/reporting/2007/11/05/0 71105fa_fact_grafton?currentPage=all. Hahn 2008 Hahn, Karla L. 2008. "Talk About Talking About New Models of Scholarly Communication." Journal of Electronic Publishing, 11:1 (2008): http://hdl.handle.net/2027/spo.3336451.0011.108. Hockx-Yu 2006 Hockx-Yu, H. "Digital Preservation in the Context of Institutional Repositores." Program: Electronic Library & Information Systems, 40:3 (2006): 232-243. Ide 1998 Ide, N., and J. Veronis. "Introduction to the Special Issue on Word Sense Disambiguation: the State of the Art." Computational Linguistics, 24:1 (1998): 2-40. Johnson 2007 Johnson, R. K. "In Googles Broad Wake: Taking Responsibility for Shaping the Global Digital Library." ARL: A Bimonthly Report on Research Library Issues and Actions from ARL, CNI, and SPARC, 2007: 1-17. http://www.arl.org/bm~doc/arlbr250digprinciples.pdf . Jones 2007 Jones, G, et al. "Multilingual Search for Cultural Heritage Archives via Combining Multiple Translation Resources." In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007): 81-88. http://www.aclweb.org/anthology/W/W07/W070911. Kaufman 2007 Kaufman, P. and J. Ubois. "Good Terms Improving Commercial-Noncommercial Partnerships for Mass Digitization." D-Lib Magazine, 13:11/12 (2007):

BLACKWELL AND CRANE

443

http://www.dlib.org/dlib/november07/kaufman/11ka ufman.html. Liu 2007 Liu, Y, et al. "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries." In Proceedings of JCDL 2007: 91-100. Lu 2007 Lu, X., et al. "Intelligent Parsing of Scanned Volumes for Web Based Archives." In ICSC 2007:International Conference on Semantic Computing: 559-568. Marchionini 1994 Marchionini, G. and G. Crane. "Evaluating Hypermedia and Learning: Methods and Results from the Perseus Project." ACM Transactions on Information Systems, 2:1 (1994): 5-34. Marchionini 2000 Marchionini, G., R. Scaife, et al. "1997-2000 Final Evaluation Report on the Perseus Project Publication Model," 2000. Retrieved 8/18/2008, from http://ils.unc.edu/~march/perseus/final_report.pdf. Martin 2006 Martin, T. R. Ancient Greece: From Prehistoric to Hellenistic Times. New Haven, Yale University Press, 1996. McCown 2005 McCown, F., S. Chan, et al. "The Availability and Persistence of Web References in D-Lib Magazine." Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05). Mobley 1998 Mobley, E. R. "Ruminations on the Sci-Tech Serials Crisis." Issues in Science and Technology Retrieved on June 30, 2008 from http://www.library.ucsb.edu/istl/98fall/article4.html. Monella 2008 Monella, P. "Towards a Digital Model to Edit the Different Paratextuality Levels Within a Textual Tradition." Digital Medievalist, 2008:http://www.digitalmedievalist.org/journal/4/mon ella/. Nadeau 2007 Nadeau, D. and S. Sekine. "A Survey of Named Entity Recognition and Classification." Lingusticae Investigationes, 30:1 (2007): 3-26. Packard 1973 Packard, D. W. "Computer-Assisted Morphological Analysis of Ancient Greek." Proceedings of the 5th Conference on Computational Linguistics, Pisa, Italy, 1973: 343-55. Panitch 2005 Panitch, J. M. and S. Michalak. "The Serials Crisis." UNC-Chapel Hill Scholarly Communications Convocation 2005, from

444

CHANGING THE CENTER OF GRAVITY

http://www.unc.edu/scholcomdig/whitepapers/panitc h-michalak.html. Pantel 2002 Pantel, P. and D. Lin. "Discovering Word Senses from Text." In Proceedings of the eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, 2002: 613-619. Parrott 2004 Parrott, J. "The Crisis in Scholarly Publishing" (August 9, 2004). Retrieved June 30, 2008, from http://www.lib.uwaterloo.ca/society/crisis.html. Petrelli 2006 Petrelli, D., et al. "Which User Interaction for CrossLanguage Information Retrieval? Design Issues and Reflections." Journal of the American Society for Information Science and Technology, 57:5 (2006): 709-722. Pomerantz 2007 Pomerantz, J. and G. Marchionini. "The Digital Library as Place." Journal of Documentation, 60:4 (2007): 505-533. Porter 2006 Porter, D., et al. "Creating CTS Collections." in Digital Humanities 2006: 269-274. Pouliquen 2003 Pouliquen, B. et al. "Automatic Identification of Document Translations in Large Multilingual Document Collections." In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2003): 401-408. Pritchard 2008 Pritchard, D. "Working Papers, Open Access and Cyber-Infrastructure in Classical Studies." Preprint of paper to appear in Literary & Linguistic Computing (2008), available at http://hdl.handle.net/2123/2226. Raymond 1987 Raymond, D. R. and F. W. Tompa. "Hypertext and the New Oxford English Dictionary." in Proceedings of the ACM Conference on Hypertext (Chapel Hill, North Carolina, United States). HYPERTEXT '87 (ACM, New York, NY): 143-153. Robinson 2005 Robinson, P. "Current Issues in Making Digital Editions of Medieval Texts, or, do Electronic Scholarly Editions Have a Future?" Digital Medievalist, 1:1 (2005): http://www.digitalmedievalist.org/journal/1.1/robinso n/. Romanello 2008 Romanello, M. "A Semantic Linking Framework to Provide Critical Value-Added Services for E-Journals on Classics." ELPUB 2008: Open Scholarship: Authority,

BLACKWELL AND CRANE

445

Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing: http://elpub.scix.net/data/works/att/401_elpub2008.c ontent.pdf. Rosenzweig 2008 Rosenzweig, R. "Can History Be Open Source: Wikipedia and the Future of the Past?" Journal of American History, 93:1 (2006): 117-146. http://chnm.gmu.edu/resources/essays/d/42. Sankar 2006 Sankar, P., et al. "Digitizing a Million Books: Challenges for Document Analysis." In Document Analysis Systems VII (2006): 425-436. Schilit 2008 Schilit, B. N., and O. Kolak. "Exploring a Digital Library Through Key Ideas." In JCDL '08: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, 2008: 177-186. Schreibman 2003 Schreibman, S. et al. "The Versioning Machine." Literary and Linguistic Computing, 18:1 (2003): 101-107. Shirky 2008 Shirky, C. Here Comes Everybody: the Power of Organizing without Organizations. New York: Penguin Press, 2008. Smith 2008 Smith, A. "The Research Library in the 21st Century: Collecting, Preserving, and Making Accessible Resources for Scholarship." In No Brief Candle: Reconceiving Research Libraries for the 21st Century, CLIR Publication No 142 (2008): 13-20. http://www.clir.org/pubs/reports/pub142/smith.html. Toselli 2007 Toselli, A., et. al. "Viterbi Based Alignment between Text Images and Their Transcripts." Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007): 9-16. Turing 1950 Turing, A. "Computing Machinery and Intelligence." Mind 59:236 (1950): 433-460. von Humboldt 1821 Wilhelm von Humboldt, "Lecture to the Prussian Academy," 1821. From a lecture delivered to the Prussian Academy of Sciences in 1821, quoted in [Von Ranke 1973, 21]. Von Ranke 1973 Von Ranke, Leopold. The Theory and Practice of History. Edited and translated by Georg G. Iggers and Konrad von Moltke. New York: Irvington Publishers, 1973.

446

CHANGING THE CENTER OF GRAVITY

Zavlavsky 2001 Zaslavsky, A. B. et al. "Using Copy-Detection and Text Comparison Algorithms for Cross-Referencing Multiple Editions of Literary Works." In ECDL '01: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, 2001: 103-114.

AUTHOR BIOGRAPHIES Alison Babeu has served as the digital librarian and research coordinator for the Perseus Project since 2004. Before coming to Perseus, she worked as a librarian at both the Harvard Business School and the Boston Public Library. She has a BA in History from Mount Holyoke College and an MLS from Simmons College. Her current research projects at Perseus include the development of an open source library of classical texts and a FRBR-inspired catalog as part of the Mellon funded Cybereditions Project. David Bamman is a senior researcher in computational linguistics for the Perseus Project, focusing especially on natural language processing for Latin and Greek, including treebank construction, computational lexicography, morphological tagging and word sense disambiguation. David received a BA in Classics from the University of Wisconsin-Madison and an MA in Applied Linguistics from Boston University. He is currently leading the development of the Latin Dependency Treebank and the Dynamic Lexicon Project. Christopher Blackwell is an Associate Professor of Classics at Furman University in Greenville, South Carolina. He holds a B.A. from Marlboro College and a Ph.D. from Duke University. He has published on historical topics for scholarly audiences and general readers, and works on a variety of projects in digital humanities. Gabriel Bodard is a researcher at the Centre for Computing in the Humanities, King's College London, working principally on digitization of inscriptions and papyri. He has been a key contributor to the EpiDoc Collaborative, and is on the Technical Council of the Text Encoding Initiative Consortium and the Steering Committee of the British Epigraphy Society. He is a founder of the Digital Classicist, and has a particular interest in collaboration between ancient world scholars and computer scientists. 447

448

CHANGING THE CENTER OF GRAVITY

Thomas Breuel is professor of computer science at the Technical University of Kaiserslautern Computer Science Department, head of the Image Understanding and Pattern Recognition (IUPR) research group at the DFKI, and a consultant in Palo Alto, CA, USA. His research group works in the areas of image understanding, document imaging, computer vision, and pattern recognition. Hugh Cayless is the Head of the Research & Development group in the Carolina Digital Library and Archives at UNC Chapel Hill. He holds a Ph.D. in Classics and a Master's in Information Science from UNC and teaches XML in the School of Information and Library Science there. He has worked on standards and practices for encoding inscriptions and related ancient material as part of EpiDoc since its inception. His current research focuses on the linkage of image and text in online enviroments. Lisa Cerrato is managing editor of the Perseus Project, overseeing a variety of work. Lisa received a BA in Latin from Tufts University, and has been with the project since 1994. Her interests include furthering classical education, particularly Latin and Greek, teaching with technology, and user-driven content management. Gregory Crane is Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship at Tufts University and the editor in chief of the Perseus Project. He has a broad interest in and has published extensively on the interaction between intellectual practice and technological infrastructure in the humanities. Daniel Deckers coordinates Teuchos. Zentrum für Handschriften- und Textforschung at Hamburg University, Germany, where he is also working on a PhD. He received an M.A. in Classics from Hamburg University and his research interests include the multispectral imaging of manuscripts, Greek codicology, and electronic edition. Casey Dué is Associate Professor and Director of Classical Studies at the University of Houston, as well Executive Editor for publications at Harvard’s Center for Hellenic Studies in Washington, D.C. She holds a B.A. in Classics from Brown University, and an M.A. and Ph.D in Classical Philology from Harvard University. Her teaching and research interests include ancient Greek oral traditions, Homeric poetry, Greek tragedy, and textual criticism. Publications include: Homeric Variations on a Lament by Briseis. (Lanham, Md.: Rowman and Littlefield Press, 2002), The Captive

AUTHOR BIOGRAPHIES

449

Woman’s Lament in Greek Tragedy (Austin: University of Texas Press, 2006) and the edited volume Recapturing a Homeric Legacy: Images and Insights from the Venetus A Manuscript of the Iliad (Cambridge, MA: Harvard University Press, forthcoming Fall 2008). She is also one of the co-editors of the Homer Multitext Project. She is currently working together with Mary Ebbott on a multitextual edition with essays and commentary on book 10 of the Iliad (Iliad 10 and the Poetics of Ambush (Cambridge, MA: Harvard University Press, forthcoming Fall 2009). Mary Ebbott is Associate Professor in the Classics Department at the College of the Holy Cross in Worcester, Massachusetts. She earned her B.A. in Classical Languages at Bryn Mawr College and her Ph.D. in Classical Philology from Harvard University. She is co-editor of the Homer Multitext project (http://chs.harvard.edu/chs/homer_multitext), and her publications include Imagining Illegitimacy in Classical Greek Literature (Lanham, Md.: Lexington Books, 2003) and the forthcoming book, Iliad 10 and the Poetics of Ambush: A Multitext Edition with Essays and Commentary, co-authored with Casey Dué. Tom Elliott is Associate Director for Digital Programs and Senior Research Scholar at the Institute for the Study of the Ancient World at New York University. Information about his current work — which spans digital approaches to epigraphy, papyrology, historical geography and other aspects of ancient studies — is provided on his home page at http://homepages.nyu.edu/~te20/ Raphael Finkel received a PhD from Stanford University in 1976 in the area of Robotics. He was a faculty member of the University of Wisconsin - Madison from 1976 to 1987. He has been a professor of computer science at the University of Kentucky in Lexington since 1987. His early research involves distributed data structures, distributed algorithms, and distributed operating systems. Recent projects include formalizing natural-language morphology with default inheritance hierarchies, designing and implementing a web-based scheme for students to work on organic chemistry homework, and using constraints to generate puzzles like Sudoku, to model an advice-giving scenario, and to build and solve logic puzzles. Dr. Finkel has published over 50 articles in refereed journals and conferences and has produced over 50 technical reports. He has written two textbooks: An Operating Systems Vade Mecum, (Prentice-Hall, 1988), and Advanced Programming Lan-

450

CHANGING THE CENTER OF GRAVITY

guage Design (Benjamin-Cummings, 1996). He is also a coauthor of The Hacker's Dictionary (Harper and Row, 1983). Sean Gillies is a computer programmer and pioneer in the field of open source geographic information systems. He has been a member of the MapServer Project's Steering Committee and now leads the GIS-Python Laboratory, an international effort to develop excellent GIS tools for the Python programming language. His sometimes influential blog focuses on the geospatial industry, open source software, and the Web. He currently directs software development at New York University's Institute for the Study of the Ancient World. Anke Lüdeling is Professor of Corpus Linguistics and Morphology at Humboldt University in Berlin. Her main interests lie in the architecture and analysis of small, deeply annotated corpora of non-standard varieties of language, such as historical corpora or learner corpora. Together with Merja Kytö, she has recently edited Corpus Linguistics. An International Handbook (published by de Gruyter). Anne Mahoney received her PhD in Classics from Boston University. She teaches Greek, Latin, and occasionally Sanskrit at Tufts University, and works on meter and poetics. She is the author of a commentary on Plautus's Amphitryo (Focus: 2004) and of articles on Giovanni Pascoli's Latin poetry, saturnians, and pedagogy. Thomas R. Martin is the Jeremiah W. O'Connor Jr. Professor in Classics at the College of the Holy Cross in Worcester, Massachusetts, USA. He was one of the original participants in the Perseus Project and a contributor to DEMOS, one of the projects under the aegis of STOA, both of which allowed him to collaborate with (and come to admire greatly) Ross Scaife. He is the author of Ancient Greece: From Prehistoric to Hellenistic Times and other works on the history of ancient Greece and Rome and of ancient western civilization. David Mimno is a PhD student in the Department of Computer Science at the University of Massachusetts, Amherst. He specializes in machine learning and text data mining. Previously he was Head Programmer at the Perseus Digital Library, where he led the development of Perseus 4.0, a completely new implementation of the library's document processing system. Gregory Nagy is Francis Jones Professor of Classical Greek Literature and Professor of Comparative Literature at Harvard

AUTHOR BIOGRAPHIES

451

University. His special research interests are archaic Greek literature and oral poetics, and he finds it rewarding to integrate these interests with information technology. He was Chair of Harvard's undergraduate Literature Concentration from 1989 to 1994, and of Harvard's Classics Department from 1994 to 2000. Currently he is Director of Harvard's Center for Hellenic Studies in Washington, D.C., while teaching half-time at Harvard's Cambridge campus. James O'Donnell is Provost at Georgetown University. He has published widely on the history and culture of the late antique Mediterranean world and is a recognized innovator in the application of networked information technology in higher education. In 1990, he co-founded Bryn Mawr Classical Review, the second online scholarly journal in the humanities ever created. In 1994, he taught an Internet-based seminar on the work of Augustine of Hippo that reached 500 students. He has served as a Director and as President of the American Philological Association; he has also served as a Councillor of the Medieval Academy of America and has been elected a Fellow of the Medieval Academy. He serves as Delegate of the APA to the American Council of Learned Societies and serves as Chair of the Executive Committee of the Delegates and as a member of the Board of Trustees of the ACLS. Dot Porter is the Metadata Manager at the Digital Humanities Observatory. She serves on the executive of Digital Medievalist. She is also the chair of the Medieval Academy of America's Committee on Electronic Resources (2006-2009) and has served on the technical council of the Text Encoding Initiative (2006 and 2007). She has worked on many digital projects in medieval studies and classics including the Electronic Boethius and the Homer Multitext project. Her particular research interests are the relationship between text and image in encoding and digital publication, and interdisciplinary collaboration. Bruce Robertson is Head and Associate Professor of Classics at Mount Allison University in New Brunswick, Canada. His research is divided between the social history of ancient Greece and projects in humanities computing. He lives on a small farm outside of Sackville, New Brunswick with his family and six animals. Charlotte Roueché is Professor of Late Antique and Byzantine Studies at King's College London. Since the late 1990s she has

452

CHANGING THE CENTER OF GRAVITY

been exploring the possibilities for digital publication of inscriptions: for the results see http://insaph.kcl.ac.uk. Jeffrey A. Rydberg-Cox teaches at the University of Missouri-Kansas City where he is Professor and Chair in the Department of English, Director of the Classical and Ancient Studies Program, and a member of the faculty of the Religious Studies Program and the Computer Science Department. He is the author of two books, more than thirty articles, and he regularly teaches courses on classical mythology and representations of the ancient world in film. Brent Seales is the Gill Professor of Computer Science and the Director of the Center for Visualization and Virtual Environments at the University of Kentucky. Dr. Seales earned his PhD in Computer Science at the University of Wisconsin. He has been a faculty member at the University of Kentucky since 1991. His central research interest is computer vision and image processing, with applications in digital libraries, medical visualization, and multimedia. He currently heads a research effort to develop visualization and scanning techniques with the goal of reading fragile threedimensional texts, such as ancient papyrus scrolls that cannot be physically unwrapped. Rashmi Singhal Rashmi Singhal received a B.S. in Computer Science and Archaeology from Tufts University. She is currently the Lead Programmer at the Perseus Project. David A. Smith is a Research Assistant Professor in the Department of Computer Science at the University of Massachusetts, Amherst. In between B.A. in classics from Harvard and his Ph.D. in computer science from Johns Hopkins, he was head programmer for the Perseus Digital Library Project. His research interests lie in several areas of computational linguistics and natural language processing, including machine translation, syntactic parsing, semisupervised learning, and digital libraries. Neel Smith is Associate Professor of Classics at the College of the Holy Cross, and leads a Technical Working Group at the Center for Hellenic Studies. With Thomas Martin, he co-hosted the meeting where the initial planning for the founding of the Stoa Consortium took place, and was a frequent collaborator with Ross Scaife. He is currently working on a project on the shared interests of Hellenistic literary and scientific scholarship.

AUTHOR BIOGRAPHIES

453

Gregory Stump is Professor of English & Linguistics at the University of Kentucky. He earned his Ph.D. in Linguistics from the Ohio State University in 1981. His areas of research specialization include morphological theory, the Indo-Iranian languages, and the Breton language. In recent years, his work has focussed on the development of Paradigm Function Morphology, a realizational theory of inflection in which paradigms are taken to be central to the definition of a language’s morphology; on the use of principalpart analysis as a basis for morphology typology; and on the grammar of the Shughni language. He is the author of Inflectional Morphology: A Theory of Paradigm Structure (Cambridge, 2001) and of numerous articles in linguistics journals and edited volumes. He is currently serving as review editor of Language and as one of the main editors of Word Structure. Melissa Terras is the Senior Lecturer in Electronic Communication in the School of Library, Archive and Information Studies. With a background in Classical Art History and English Literature, and Computing Science, her doctorate (University of Oxford) examined how to use advanced information engineering technologies to interpret and read the Vindolanda texts. Publications include Image to Interpretation: Intelligent Systems to Aid Historians in the Reading of the Vindolanda Texts (2006, Oxford Studies in Ancient Documents. Oxford University Press) and Digital Images for the Information Professional (2008, Ashgate). She is a general editor of DHQ, the Vice president of the Association for Computers and the Humanities, and an executive member of the Association of Literary and Linguistic Computing. Her research focuses on the use of computational techniques to enable research in the arts and humanities that would otherwise be impossible. Amir Zeldes is a researcher working and teaching at the Institute for German Language and Linguistics at Humboldt University in Berlin. He studied Linguistics and Cognitive Science in Jerusalem followed by German, Indo-European and Computational Linguistics in Berlin and Potsdam, before starting his doctorate on quantitative approaches to linguistic productivity. Amir is also working within Collaborative Research Centre 632 on Information Structure in a team developing ANNIS2, a web browser-based search and visualization architecture for richly annotated multilevel corpora.

INDEX Anabasis, 232 Ancient World Mapping Center, 229, 250 annotation, 29, 43, 102, 313, 410 Antenor, 180 Aphrodisias, 204, 207 Aphrodisias project, 161 apographeme, 329 apparatus, 182, 408 Aquinas, Thomas, 13 Arabic, 45, 382, 393, 394 Archaeological Settlements of Turkey Project, 228 archaeology, 205 Archimedes Digital Library Project, 138 Aristarchus, 154, 184 Aristophanes, 98 artifacts, 157 Arts and Humanities Data Service, 245 Arts and Humanities Research Council, 211 Association Internationale d’Épigraphie Grecque et Latine, 210 Association of Research Libraries, 424 Athenaeus, 70

abbreviation, 137, 141, 145, 214 ABBY FineReader, 138 Abydos, 99 academic publishing, 84 access, 5, 336 Achilles, 193, 381, 438 Adler, Ada, 92 Adobe, 130 Aeneid, 344, 402 Aeschines, 184, 191 Aeschylus, 19, 34, 404, 415 AfricaMap Project, 228 Akkadian, 41, 393 Alexander the Great, 335 Alexandria, 42, 80, 184, 344, 415 Alexandria Digital Library, 241 Gazetteer, 235 Allen, T. W., 61, 73 Altertumswissenschaft, 6 Amazon, 232 American Civil War, 327, 394 American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences, 253 American History, 327 American Philological Association, 196

455

456

CHANGING THE CENTER OF GRAVITY

Athenian democracy, 385 Athens, 2, 12 Atom Syndication Format,, 244 Augustine, 349 Australia, 44 Austrian Academy, 205 authoritative performance, 188 Barrington Atlas of the Greek and Roman World, 250 Bartleby, 139 Bayesian classification, 311 Bekker, Immanuel, 92 Benedict, Jennifer, 94 Bernhardy, Gottfried, 92 Bible, 91 bibliography, 335 Biblioteca Nationale Marciana, 75 Biodiversity Heritage Library, 417 blind reviewing, 95 blogs, 2, 3, 27, 30 Boer War, 113 Bona Dea, 62 born digital, 342 Bowman, Alan, 211 brevigraph, 137, 142 British Library, 75 Brunner, Theodore, xvii Bryn Mawr Classical Review, 20, 96 Bulgarian, 276 Busa, Father, 13 Byzantine, 89, 90, 94, 104, 248 Byzantium, 35 Caesar, 314 Calabria, 224 Calchas, 79 Canada, 44

Canonical Text Services, 34, 132, 197, 396 cartography, 228, 237 Cascading Style Sheet, 214 Catholic Encyclopedia, 101 Center for Hellenic Studies, 75, 152, 156, 196, 249, 420 Cervantes Project, 177 Chapel Hill, 210 Chavez, Robert, 247 China Historical GIS, 228 Chinantec, 292 Chinese, 45 chronology, 112, 113 Cicero, 106, 308, 314 CIDOC-CRM, 42, 112, 118, 127, 128, 413 citation, 34, 60, 72, 152, 155, 164, 224, 301, 315, 340, 348, 386, 401, 409 CiteSeer, 333 Citizendium, 100, 106 Classical Atlas Project, 250 COBUILD English Language Dictionary, 300 COBUILD Project, 300 Cocoon, 119, 130 codex, 153 collaboration, 37, 60, 90, 92, 94, 106 collocation analysis, 301 Comaltepec, 292 comedy, 99, 102 commentary, 343, 348, 413 composition-in-performance, 189 computational linguistics, 315, 318 computer science, 318, 327 concordance, 405

INDEX conjugation, 266, 279, 280, 282, 285, 286, 288, 290 Constantine, 303 Constantinople, 204 Coptic, 393 copyright, 195, 236 Cornell Greek Epigraphy Project, 209 corpora, 141, 299, 303, 315, 328, 334, 335, 380, 393 Corpus Inscriptionum Latinarum, 207 corpus linguistics, 327, 335 Crane, Gregory, 138, 330, 397 Creative Commons, 150, 196, 420, 424, 429 Crimea, 224 critical editions, 139, 140, 174, 182 Croatian, 42, 393 cross language information retrieval, 394 cross language information retrieval, 401 Crowther, Charles, 211 CSV, 252 cuneiform, 100 customization, 383 Cybereditions Project, 331, 341, 346, 397, 402 cyberinfrastructure, 19, 30, 39, 42, 299, 303, 317, 354, 378, 379, 387, 391, 438, See Cyberinfrastructure, 40 Czech, 314 Dacia, 224 Daniels, Maria, 248 Danube, 224 data mining, 9 DATR, 265

457 David Rumsey Historical Map Collection, 237 Degree Confluence Project, 246 Demetrius of Phalerum, 192 democracy, 386 Dēmos: Classical Athenian Democracy, 63 Demosthenes, 64 Derrida, Jacques, 390 desinence, 274, 275 DHTML, 120 dictionary, 298, 315, 343, 387 Dictionary of Classical Bibliography, 329 Dictionary of Greek and Roman Antiquities, 101 digital community, 29 Digital Donne, 186 digital edition, 18, 343, 407, 415 Digital Humanities Quarterly, xv, 424, 426 digital image. See digital images, 140, 163, 195, 336, 349, 436 digital incunabula, 20 digital library, 26, 84, 152 digital repositories, 228, 244, 246, 426 digital surrogates, 336 digitization, 136, 206, 236, 326, 340, 354, 417 Diogenes, 137 Diogenes Laertius, 140 Diomedes, 165 Diotima, 37, 46, 96 diphthong, 316 diplomatic, 216 D-Lib Magazine, 229, 423 domain name system, 161 Don Quixote de la Mancha, 177

458

CHANGING THE CENTER OF GRAVITY

Donne, John, 177 drama, 91 DTD, 210 Dublin Core Metadata, 117, 118, 157, 158, 247 Dutch, 393 Dynamic Lexicon for Greek and Latin, 399 dynamic mapping, 250 eAqua Classics Text Mining Project, 350 early printed books, 136 ECAI, 247 eClassics, 30, 378, 379, 388 EDINA, 245 editorial mechanism, 95 Egypt, 224 Electronic Archive of Greek and Latin Epigraphy, 209 Electronic Cultural Atlas Initiative, 247 Elexiko project, 301 Encyclopaedia Britannica, 101, 106 encyclopedia, 112 Encyclopedia of Suidas, 91 English, 41, 44, 67, 299, 314, 348, 382, 393, 401 English Literature, 327 ePhilology, 30, 378, 379 Ephorus, 351 EpiDoc, 79, 210, 426 EpiDoc-Aphrodisias Pilot Project, 211 Epigraphic Database Bari, 209 Epigraphic Database Roma, 209, 217 Epigraphische Datenbank Heidelberg, 209, 217 epigraphy, 204

eResearch, 6 eScience, 6 essay, 61, 65 Euclid, 335 Euripides, 415, 428 European Union, 44 EuroWordNet, 409 Evans, Roger, 265 eXist, 115 Farsi, 394 FastiOnline, 229 Fellows, Charles, 204 Flex3, 130 fragment, 69 French, 41, 44, 282, 292, 348, 382, 393, 401 Functional Requirements for Bibliographic Records, 159 Fur, 292 Gainsford, Thomas, 92 GAMERA, 138 Gandy-Deering, John, 205 Gazdar, Gerald, 265 geography, 113 Geography of Slavery in Virginia, 115 GeoNames database, 235 GEONet, 239 geoparsing, 230, See Georgian, 276 GeoRSS, 229, 231, 235, 246, 251 geosearch, 230 German, 44, 348, 382, 393, 401 German,, 41 Getty Thesaurus of Geographic Names, 235, 411 Gibson, Jeffrey, 92 GIS, 84, 242, 244, 251, 252, 436 Gladiator, 196

INDEX Glaukos, 165 Global Spatial Data Infrastructure Association, 245 glyphs, 136 Google, 8, 14, 15, 20, 31, 37, 113, 206, 229, 235, 237, 324, 325, 330, 334, 345, 346, 403, 419, 427, 436 Base Application, 168 Google Books, 14, 20, 22, 131, 136, 140, 147, 232, 234, 235, 325, 419, 436 Google Earth, 236, 244 Google Earth Community Bulletin Board, 235 Google StreetView, 238 GPS, 248 Waypoints, 248 grammar, 410 Greek, 105, 148, 152, 176, 204, 304, 340, 345, 379, 404, 435 Modern, 393 Greek Lexicon Project, 300 Greek New Testament, 177 grid, 243, 244, 427 Grosseteste, Robert, 91 Gutenberg, 11 Handbook to Classical Texts, 346 Harper's Dictionary of Classical Antiquities, 101 Harvard Classics Computing Project, 30 Harvard's Center for Geographic Analysis, 228 HBO, 196 Hebrew, 276, 292, 393 Helen, 180 HEML, 253, 412, 413

459 Herodotus, 2, 112, 325, 352 Hesiod, 379 Hidden Markov Models, 146 high performance computing, 31 Historian's Workstation Project, 114 History of Science,, 395 History of the Peloponnesian War, 412 Hittite, 393 Homer, 10, 29, 61, 80, 349, 384, 395 Homer Multitext, 76, 80, 83, 175, 177 Homeric poems, 153 Homeric Question, 174, 193 HTML, 119, 145, 146, 206, 214, 229, 230, 251 Huffman encoding, 288 humanities, 6 Hungarian, 292 Hutton, William, 92 Hyman, Steven E., 419 hyphenation, 137 Iliad, 61, 154, 170, 174, 177, 181, 182, 184, 189, 381 incunabula, 23, 26, 136 index, 74, 101 Index Thomisticus, 299 inflection, 265 inflectional paradigm, 264 infrastructure, 10, 16, 31, 33, 36 inscription, 204, 394 Inscriptiones Latinae Selectae, 209 Institute for Museum and Library Services, 330 Institute for the Study of the Ancient World, 425

460

CHANGING THE CENTER OF GRAVITY

institutional repositories, 243 Internet, 15, 68, 112 Internet Archive, 324, 326 Internet Shakespeare Editions, 186, 188 interpretation, 59 Isidore of Seville, 139 Italian, 44, 345, 348, 382, 393, 401 Jacobus de Voragine, 140 Java, 119 Jerome, 309, 314 Johnson, William, xvii Jordan, 224 JSTOR, 20, 38, 67, 341 KATR, 265, 269, 290 keywords in context, 405 KML, 229, 231, 235, 244, 246, 248, 251, 252 knowledge bases, 43, 44 Kosman, Joshua, 397 Küster, Ludolf, 92 Kummer, Robert, 128 Larousse Dictionary, 282 Latin, 136, 148, 204, 265, 282, 283, 285, 299, 302, 304, 306, 309, 328, 335, 340, 353, 379, 382, 395, 404, 435 Legenda Aurea, 138, 149 Leiden System, 205, 206, 213 lemmata, 153, 302 lemmatization, 308 Lewis and Short Latin-English lexica, 351 lexeme, 264, 267 lexicography, 300, 317, 398, 409, See lexicon, 304, 315, 343, 390 libraries, 66, 76, 414

library and information science, 327 Library of Congress, 245 Library of St. Mark, 75 Libya, 224 Liddell-Scott-Jones, 31, 351 Lingala, 276, 292 linkrot, 423 lists, 74 Livy, 13 LOCKSS, 427 Loeb Classical Library, 332, 347 Lord, Albert, 178 Machiavelli, 344 machine learning, 303 machine processing, 326 machine translation, 44, 383, 435 macron, 136 manuscript, 176, 394 mapping, 215 MapQuest, 237 maps, 119 MapServer, 239 markup, 114, 115, 164, 403, 422 Marlowe, 394 mashups, 238 McCarty, Willard, 103 Meadows, David, 92 Mediterranean, 393 Mediterranean Archaeological GIS, 229 memography, 379, 381, 385 Menelaos, 180 metadata, 205, 243, 244, 338, 341, 342 metrical analysis, 400 Microsoft, 436 Microsoft Live Search, 237 Miliarium Lyciae, 225

INDEX modelling, 156, 238, 277, 328, 406 Modern Greek, 42 Modern Language Association, 176 monetizable, 242 monograph, 385 Morph, 397 Morpheus, 398 morphology, 35, 264, 265, 267, 276, 292, 340, 397 multiforms, 179, 181, 193 multilingualism, 393 Mycenaeans, 122, 123 Nagy, Gregory, 168, 184, 191 National Digital Information Infrastructure and Preservation Program, 245 National Endowment for the Humanities, 147, 330, 332, 399 National Geospatial Data Clearinghouse, 245 National Geospatial Digital Archive, 245 National Institutes of Health, 331, 418 National Research Council Mapping Science Committee, 241 National Science Foundation, xvii, xix, 40, 391 National Spatial Data Infrastructure, 245 natural language processing, 299, 306, 435 NAVSTAR Global Positioning System, 248 New Greek Lexicon, 300

461 New Variorum Shakespeare, 348 New Zealand, 44 newspapers, 380, 382, 394 Noah, 103 notation, 273 numismatics, 162 OCLC WorldCat, 232 Odysseus, 180, 193, 438 Odyssey, 174, 177, 184, 189 Old Persian, 393 Olympia, 234 Olympic Register, 63 ontologies, 43, 156, 339, 412 open access, 37, 195 Open Archive Initiative, 131 Open Content Alliance, 331, 346 Open Content Alliance, 22, 136, 140, 147, 209 Open Content Alliance, 429 open source, 37, 218, 326, 426 OpenContext, 229 OpenLayers, 130, 237 Optical Character Recognition, 15, 131, 138, 139, 140, 147, 328, 330, 332, 334, 341, 350, 396, 420 accuracy, 336, 341, 345 quality, 333 oral tradition, 177 ordered hierarchy of content objects, 164 orthography, 224, 316 overlapping hierarchies, 164 overlays, 224 Overview of Greek Civilization, 385 Ovid, 106, 314 Oxford Classical Dictionary, 91

462

CHANGING THE CENTER OF GRAVITY

Oxford English Dictionary, 100, 298, 300, 412 Oxford Latin Dictionary, 298 Oxford Reference on-line, 101 Packard Humanities Institute, 340 Packard Humanities Institute Database of Latin, 145 Packard Humanities Institute’s Greek Epigraphy project, 209 Packard, David, xvii, 13, 397 palaeography, 82, 187, 298 Pali, 292 Panciera, Silvio, 210 PĆŀini’s principle, 268 papyri, 183, 191, 394 papyrus, 329 Parry, Milman, 178 parsing, 314 Patara, 225 Pausanias, 204 PDF, 230, 336 peer review, 107, 422, 426 Peisistratidai, 192 Penn Treebank, 313 Pericles, 12, 30, 39, 434 Perseus Digital Library, 13, 24, 25, 26, 31, 39, 73, 96, 101, 128, 139, 145, 147, 239, 302, 306, 310, 340, 341, 351, 352, 353, 388, 397, 399, 401, 404, 420, 436 Perseus Atlas, 239, 249 Perseus Lookup Tool, 240 Perseus morphological analyzer, 149 Persian, 382 personalization, 383 Petarch, 32

Petrarch, 139 Petronius, 314 Peutinger Map, 225 Phaedrus, 388 Philippus Pincius, 140 philology, 6, 11, 13, 36, 45, 60, 298, 327, 382 phonologics, 275 Pincius, 137 Pindar, 2, 4 pixel, 148 plagiarism, 59 Plato, 2, 8, 183, 184, 344, 350, 378, 382, 404, 421 Plato's Challenge, 388 Pleiades Project, 229, 250 Pliny the Elder, 140 Plone, 244 Plutarch, 72, 351, 408 Poetae Comici Graeci, 73 Polish, 276, 292 Polyclēs, 64 portal, 243 Portus, Aemilius, 92 PostGIS, 240 PostgreSQL, 240 Prague Dependency Treebank, 313 preprints, 34 preservation, 246 Priam, 438 primary source, 84, 131 primary sources, 68, 113, 382 principal-part analysis, 265 print, 13, 16, 19, 30, 31, 34, 35, 41, 68, 84, 174 print culture, 379 print edition, 101, 174, 206, 304, 315, 328, 329, 334, 415, 421 print on demand, 330

INDEX pronunciation, 293 Propertius, 314 prosopography, 112, 113, 157, 162, 343, 432 provisioning, 270 Ptolemy, 249 PubMed, 418 QDDB, 93 QuickTime VR, 238, 240 quotation, 383, 402 Ramminger, Johann, 32 reader, 97 reading environment, 5 Real-encyclopädie, 224 Red Sea, 223 Reddy, Sravana, 138 reference, 34, 78 Register of Geographic Entities, 249 Relational Database Management System, 217 Repertoire des bibliothèques et des catalogues de manuscrits grecs, 347 Resource Description Framework, 114, 122 RESTful, 243 Rexa System, 333 robotic scanners, 141 Roman, 152, 316 Romania, 224 Rome, 196 Rosenzweig, Roy, xvii Roth, Catherine, 93 Russian, 393 Rydberg Cox, Jeff, 399 Sallust, 314, 350 Samnite Wars, 127 sandhi, 272, 281, 283 Sanskrit, 45, 292

463 Sappho, 415 Scaife Digital Library, 375, 378, 425 Scaife, Ross, xiii, xv, xvii, xxi– xxiv, 46, 57, 89, 98, 114, 152, 197, 211, 247, 253, 377, 378, 391, 425, 439 Scalable Named Entity Services for Classical Studies, 330 schema, 115, 157 scholarly edition, 139, 185 scholia, 154 search, 96, 101, 113 search providers, 242 Sebastian Brant, 140 secondary sources, 68 Semantic MediaWiki, 124, 126 Shakespeare, 176, 187, 327, 384, 394, 408 Sherard, William, 204 shovel ware, 153 sigla, 77 Simile Timeline Widget, 120, 130 Sketch Engine, 301 slavery, 316, 317 Slovak, 292 Smith, David, 240 Smith, Neel, 249, 397 Socrates, 2, 99, 389 Solon, 325 Sophocles, 61, 97, 335, 415 Sora, 292 South Africa, 44 spam filtering, 311 Spanish, 276, 292, 393 SPARQL, 124, 129, 130 spelling, 282, 316 SQL, 124, 218 squeezes, 205

464

CHANGING THE CENTER OF GRAVITY

Stoa Consortium for Electronic Publication in the Humanities, xxi, 27, 28, 151, 156, 161, 197, 377, 425 students, 59 Stultifera Navis, 138 Suda, 89, 90, 91 Suda Online, 28, 35, 65, 248 Suetonius, 140 Sumerian, 393 SVG, 120, 130 Swahili, 276 syntax, 43, 312 Syriac, 393, 404 Tacitus, 307 temporal logic, 131 text alignment services, 402 Text Encoding Initiative, 18, 34, 79, 118, 130, 139, 144, 157, 164, 196, 244, 340, 412, 420, 426, 434 text mining, 9, 227, 343, 383, 389, 392, 397, 435 The New Pauly, 91 The Princeton Encyclopedia of Classical Sites, 101 Themistocles, 425 Thesaurus Linguae Graecae, xviii, 19, 20, 21, 30, 32, 72, 79, 302, 329, 340, 411 Thesaurus Linguae Latinae, 31, 298 Thesmophorion, 62 Thetis, 381 Thucydides, 5, 8, 12, 14, 33, 46, 304, 343, 344, 352, 385, 401, 429 Timaeus, 71 timeline, 119 TimeMap, 247

TimeML, 131, 410 Topographical Dictionary of Ancient Rome, 101 Toup, Jonathan, 92 training data, 405 transcription, 78, 82, 139, 141, 148, 328, 337, 342, 397 translation, 92, 94, 95, 103, 106, 306, 348, 353, 387, 400, 402 treebanks, 410, 431, 433, 434 Trojan, 181 Troy, 196, 439 Tunisia, 224 Turing, Alan, 389 Turkish, 276, 394 typeface, 136 U.S. Epigraphy Project, 211 undergraduate research, 45, 58, 59, 83 Unicode, 96, 105, 142, 215 Uniform Resource Names, 166 United States, 44 Unsworth, John, 206 URLs, 158, 243, 423 Vanhoutte, Edward, 185 variant, 78, 194, 343, 413, 431 variation, 179, 188 Variorum, 176 Venetus A, 75, 81, 189, 349, 396, 433 Venetus B, 75 Vergil, 106, 314 version analysis services, 403 Versnel. H. S., 62 VGI, 246, 247, 250 Virtual Earth, 230 visualization, 115, 116, 119, 127, 237, 238, 343, 383, 392, 405 Vitae et Sententiae Philosophorum, 138

INDEX Web Ontology, 125 Whitehead, David, 93 wiki, 2, 27, 30 Wiki, 93, 95, 105 Wikimapia, 246 Wikipedia, xviii, 3, 28, 35, 65, 70, 90, 94, 100, 101, 106, 429 WikiWikiWeb, 93 Wissenschaft, 6 word forms, 277 word frequency, 301 word sense disambiguation, 310, 312, 340, 400 word sense discovery, 340, 398 WordNet, 132, 409 World Wide Web, 112 World Wide Web Consortium, 157

465 WYSIWYG, 338 Xenophon, 232, 351 Xerox PARC Map Viewer, 237 XHTML, 157 XML, 79, 81, 114, 115, 117, 118, 122, 129, 130, 145, 146, 157, 196, 210, 213, 217, 219, 251, 307, 342, 420, 422, 424, 426, 436 XML namespaces, 158 XQuery, 122 XSLT, 122 Yahoo, 237, 436 Yiddish, 292 Zenodotus, 80, 182 Zotero, 112