Fourth South American Workshop on String Processing (WSP 1997) 9780773591400

This is an original, thoroughly researched account of the image of Canada in Soviet writings - political, jounalistic an

127 21 5MB

English Pages 206 [209] Year 1997

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Fourth South American Workshop on String Processing (WSP 1997)
 9780773591400

Table of contents :
Cover
Title
Copyright
Contents
Preface
Generalized Pattern Matching: the Case of Swaps
Large Text Searching Allowing Errors
Proximity Queries in Metric Spaces
Suffix Tree Constructions: New Techniques and Optimal Algorithms
A General Technique to Improve Filter Algorithms for Approximate String Matching
Distributed Generation of Suffix Arrays: A Quicksort Based Approach
Transposition distance between a permutation and its reverse
Practical Use of the Warm-up Algorithm on Length-Restricted Coding
Indexing Compressed Text
A Partial Deterministic Automaton for Approximate String Matching
Multiple Approximate String Matching by Counting
Asymptotic estimation of the average number of terminal states in DAWGs
On the multi backward DAWG matching algorithm (MultiBDM)
Approaching the dictionary in the implementation of a natural language processing system: toward a distributed structure
String Databases and Finite Multitape Automata
An Algorithm for Graph Pattern-Matching

Citation preview

INTERNATIONAL INFORMATICS SERIES 8 Editors in C hief

E v a n g elo s K ra n a k is

Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6 Consulting Editors

N ic o la S a n to ro

Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6

F ra n k D ehne

D an n y K rizan c

JO rg R O diger S ack

Jo rg e U r r u tia

Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6 Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6 Series E ditor

Jo h n F lo o d

Carleton University Press Carleton University Ottawa, ON, Canada K1S5B6

Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6 Carleton University School of Computer Science Ottawa, ON, Canada K1S5B6

International Informatics Series 8 Ricardo Baeza-Yates (Ed.)

FOURTH SOUTH AMERICAN WORKSHOP ON STRING PROCESSING WSP 1997

4th Workshop, WSP ’97 Valparaiso, Chile, November 12-13, 1997 Proceedings CARLETON UNIVERSITY PRESS

Copyright © Carleton University Press, Inc. 1997 Published by Carleton University Press. The publisher would like to thank the Vice-president (Academic), the Associate Vice-President (Research), the Dean of Science, and the School of Computer Science at Carleton University for their contribution to the development of the Carleton Informatics Series. Carleton University Press would also like to thank the Canada Council, the Ontario Arts Council, the Government of Canada through the Department of Canadian Heritage, and the Government of Ontario through the Ministry of Culture, Tourism and Recreation, and the Ontario Arts Council. Printed and bound in Canada.

C anadian Cataloguing in Publication Data

South American Workshop on String Processing (4th: 1997: Valparaiso, Chile) Fourth South American Workshop on String Processing (WSP ‘97) (International informatics series ; 8) Conference held Nov. 12-13, 1997. Includes bibliographical references. ISBN 0-88629-338-3 1. Text processing (Computer science)—Congresses. I. Baeza-Yates, Ricardo A. n. Title. III. Series. QA76.9.T48S68 1997 005 C97-901056-X

CONTENTS Preface

vu

Generalized Pattern Matching: the Case of Swaps (abstract of invited talk) Amihood Amir

1

Large Text Searching Allowing Errors M ardo Dromond Araujo, Gonzalo Navarro and Nivio Ziviani

2

Proximity Queries in Metric Spaces Edgar Chavez and Jose Luis Marroquin

21

Suffix Tree Constructions: New Techniques and Optimal Algorithms (abstract of invited talk) M artin Farach

37

A General Technique to Improve Filter Algorithms for Approximate String Matching Robert Giegerich, Stefan Kurtz, Frank Hischke and Enno Ohlebusch

38

Distributed Generation of Suffix Arrays: A Quicksort Based Approach Joao Paulo Kitajim a, Gonzalo Navarro, Berthier Ribeiro-Neto and Nivio Ziviani

53

'transposition distance between a perm utation and its reverse Joao Meidanis, Maria Emilia Walter and Zanoni Dias

70

Practical Use of the W arm-up Algorithm on Length-Restricted Coding Ruy Luiz Milidiu, A rtur Alves Pessoa and Eduardo Sany Laber

80

Indexing Compressed Text Edleno S. de Moura, Gonzalo Navarro and Nivio Ziviani

95

A Partial Deterministic Automaton for Approximate String Matching Gonzalo Navarro

112

Multiple Approximate String Matching by Counting Gonzalo Navarro

125

Asymptotic estimation of the average number of terminal states in DAWGs Mathieu Raffinot

140

On the multi backward DAWG matching algorithm (MultiBDM) Mathieu Raffinot

149

Approaching the dictionary in the implementation of a natural language processing system: toward a distributed structure Vera Lucia Strube de Lima, Paulo Ricardo Carneiro Abrahao and Ivandre Paraboni

166

String Databases and Finite Multitape Autom ata (abstract of invited talk) Esko Ukkonen

179

An Algorithm for Graph Pattern-M atching Gabriel Valiente and Conrado Martmez

180

PREFACE The Fourth South American Workshop on String Processing (W SP ’97) was held in Valparaiso, Chile, on November 12-13, 1997, as part of a bigger event, the XXIII Latin-American Conference in Informatics and the XV International Conference of the Chilean Computer Science Society, locally organized by the Technical University Federico Santa Maria. The First, Second, and Third Workshops were held in Belo Horizonte, Brazil, in September 1993, in Valparaiso, Chile, in April 1995, and in Recife, Brazil, in August 1996, respectively. This volume contains all thirteen contributed papers presented at the work­ shop, together with two abstract of invited speakers. The topics of the papers include Text Searching , Similarity Searchingt Computational Biology , Graph Pat­ tern Matching , and Natural Language Processing. Four invited talks were also presented at the workshop by the following peo­ ple: Amihood Amir (Georgia Tech, USA & Bar-Ilan University, Israel), M artin Farach (Bell Labs., USA), Robert Meersman (Vrije Universiteit Brussel, Bel­ gium), and Esko Ukkonen (U. of Helsinki, Finland). We also had one tutorial on computational biology given by Joao Meidanis (UNICAMP, Brasil). The program committee was composed by Amihood Amir (Georgia Tech, USA & Bar-Ilan University, Israel), Richard Arratia (USC, USA), Ricardo BaezaYates (Univ. de Chile, Chile, chair), Daniel Corach (Univ. de Buenos Aires, Argentina), Gaston Gonnet (ETH, Switzerland), Joao Meidanis (UNICAMP, Brazil), Esko Ukkonen (Univ. of Helsinki, Finland), and Nivio Ziviani (Univ. Fed. de Minas Gerais, Brazil). We would like to thank the members of the pro­ gram committee, as well as A. Bassi, S.T. Klein, J. da M ata, E. de Moura and G. Navarro for the valuable help they offered in refereeing some of the submissions. The criteria for selection were based primarily on quality; we also considered relevance, clarity and the potential benefit to the community. We would like to thank the CS Dept, of the Technical University Federico Santa Maria and the Chilean Computer Science Society; and in particular Gonzalo Navarro, for helping with the local organization. The workshop has been supported by the following organizations: Red Iberoamericana de Tecnologia del Software (RITOS/CYTED), Project CYTED vn .13 (AMYRI), and the Department of Computer Science of the University of Chile. Their support is gratefully acknowledged. Finally, we wish to thank Carleton University Press for publishing this vol­ ume, in particular to John Flood and Jennie Strickland. Ricardo Baeza-Yates Valparaiso, Chile, November 1997.

Generalized Pattern Matching: the Case of Swaps Amihood Amir 123 (amirQcc.gatech.edu)

1 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA; 3 Department of Computer Science, Bar-Han University, 52900 Ramat-Gan, ISRAEL; * Research partially supported by NSF grant CCR-96-101709, the Israel Ministry of Science and the Arts grant 6297 and a 1997 Bar-Han University Internal Research Grant. Abstract. Many different meanings can be assigned to the words “peneralizatiori" of pattern matching. One such possibility is that of allowing “local” errors. Intuitively, we group under the label “locaF’ errors that take place in a bounded location, as opposed to changes that permeate the entire data (e.g. scaling, rotation). Specifically, consider the most limited local error - the mismatch. This error occurs in a single symbol and effects only its location. In contrast, insertions and deletions have a global effect, although the error itself is confined to a single location. In between lies the swap. It is an error that occurs locally and generally has only a local effect. However it may cause a ripple with far-ranging effects. We discuss the known Upper bounds for matching with various local errors. We describe several issues and techniques with special emphasis on new results concerning swaps.

This article was processed using the I^TeX 2e macro package with CUP.CS class

Large Text Searching Allowing Errors Marcio Drumond Araujo 1 Gonzalo Navarro 23 Nivio Ziviani13 1 Depto. de Ciencia da Computagao, Universidade Federal de Minas Gerais, Brazil. 2 Depto. de Ciencias de la Computacion, Universidad de Chile, Chile. 3 This work has been supported by Brazilian CNPq Project 520916/94-8, Project RITOS/ CYTED and Chilean Fondecyt grants 1960881 and 1950622. E-mail: [email protected], [email protected], [email protected] Abstract. We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table con­ taining the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usu­ ally much less than 1% of the text size and hence can be kept in main memory, where most query processing takes place. The text, on the other hand, is not accessed at all. The algorithm permits a large num­ ber of variations of the exact and approximate string search problem, such as phrases, string matching with sets of characters (range and ar­ bitrary set of characters, complements, wild cards), approximate search with nonuniform costs and arbitrary regular expressions. The whole in­ dex can be built in linear time, in a single sequential pass over the text, takes near 1/3 the space of the text, and retrieval times are near 0(y/n) for typical cases. Experimental results show that the algorithm works well in practice: for a one-gigabyte text collection, all matchings of a phrase of 3 words allowing up to 1 error can be found in approximately 6 seconds and allowing no errors can be found in under half a second. This index has been implemented in a software package called Igrep, which is publicly available. Experiments show that Igrep is much faster than Glimpse in typical queries.

1 Introduction The full text model in information retrieval (IR) is gaining popularity. In this model, documents are represented by their complete full texts. The user expresses his information needs by providing strings to be matched and the information system retrieves those documents containing the user specified strings. When the text collection is large it demands special­ ized index techniques for efficient text retrieval. A simple and popular indexing technique is the inverted list. It is especially adequate when the pattern to be searched for is formed by simple words. This is a common type of query, for instance when searching the World Wide Web, and therefore inverted lists have been widely used in that context.

2

One weakness of commercially available large text searching systems is the need for exact spelling due to the use of hashing or tree structures in the index. However, in many situations the pattern and/or the text are not exact, due to optical character recognition, typing or misspelling errors or because we are looking for approximate patterns. For example, a name we are looking for may be misspelled in the text or we may not remember its exact spelling. The approximate text searching problem is to find all substrings in a text database that are at a given “distance” k or less from a pattern p. The distance between two strings is the minimum number of insertions, deletions or substitutions of single characters in the strings th at are needed to make them equal. The case k = 0 corresponds to the classical exact matching problem. The classical solution for approximate searching is 0(m n) time, where m is the size of the pattern and n is the size of the text [Sel80]. Since the beginning of the eighties there is a long list of papers on the subject, where [BYG92, WM92, CL92, ST95, BYP92, WMM96, BYN96a, Nav97] is a partial list of the most recent ones. From the practical point of view an im portant new paradigm called bit-parallelism was developed by Baeza-Yates and Gonnet [BYG92]. In their algorithm the state of the search is represented as a number and only bitwise logical operations shifts and additions are used. Wu and Manber [WM92] extended this numeric scheme to deal with the more general approximate string matching problem under some editing distance. They present a 0{kn) algorithm (where k is the number of errors) that supports a large number of variations of the problem. Recently, this algorithm has been improved to 0(n ) for small patterns (e.g. up to 9 letters on a 32-bit architecture) [BYN96a]. On the other hand, the problem of finding good indexing schemes th at allow approximate searching was considered in [WM92, BY92] the unresolved problem in this area. There are many different linear time approximate string matching algorithms, but only recently there is some work done for the case when the text is large and an index must be built to speed up the search. We can distinguish two different indexing models. The first is capable of retrieving any substring of the text whose edit distance to the pattern is sufficiently small. The second retrieves only complete words, whose edit distance to the pattern is small enough. For instance, only the first model will find "shallow " with one error in the text " . . . sha llo w . . . " , although it will also find that pattern in the text " .. .hash a llo w s. . . " , what we probably don’t want. Although the first model is more general,

3

the second one may be better suited for IR purposes on natural language text. Moreover, most indices for the first model are still in a preliminary stage: indices are too large and no disk storage strategies have been de­ vised yet. The implementations are in general very primitive prototypes. Examples of these indices are [Ukk93, Cob95, BYNST96, ST96, LST96, Mye94]. This work focuses on word-retrieving indices. One successful attem pt to solve this problem was presented by Manber and Wu [MW93] in a system called Glimpse. They propose a two-level information retrieval structure that combines a partial inverted file with sequential searching. They divide the text into nearly 256 blocks of the same size and build an index of all different words plus a list of the blocks where each word appears. Approximate queries are handled by first using an on-line algo­ rithm (Agrep [WM92]) on the vocabulary to find all words in the index that match approximately with the pattern, and then the corresponding blocks are searched, using Agrep again, to find the particular matches. In the worst case, it may be necessary to search all the blocks, which makes Glimpse adequate for use with intermediate large text collections (say up to 200 megabytes). Baeza-Yates and Navarro [BYN97] study an alternative scheme where the text is not searched for the approximate pattern but with a m ultipattern search of all the words in the vocabulary that matched the pattern. They also prove th at it is possible to have an index which is sublinear in space and time simultaneously and study the practical effect of the block size. In this paper we present an efficient word-retrieving indexing scheme for large text searching, which is fast at indexing and querying time and has the capability of searching exactly or allowing errors in the pattern and/or in the text. The index can be built in 0 (n ) time and takes 0(n ) space. Querying performance is near 0(y/n ) time. The implementation of the algorithm has been tested successfully for files with more than 1 gigabyte of text. It supports a large number of variations of the approx­ imate string search problem. In addition to single words and phrases, the system supports string matching with sets of characters (range and arbitrary set of characters, complements, wild cards), nonuniform costs and arbitrary regular expressions. The algorithms presented in this paper are being used in a software package called Igrep. Igrep is an approximate matching tool for very large text collections. The software package is a prototype in its version 1.0, which is available from ftp ://d c c .u fm g .b r/p u b /re s e a rc h /~ n iv io /igrep.

4

2 Structure of the Index We present an index based on the traditional inverted list model. We view a text file as a sequence of words, separated by the usual delimiters (e.g. space, end-of-line, period, comma). We scan the whole text, word by word, build a table containing all different words (the vocabulary) of the text and store every occurrence of each word on a list. The list of occurrences of each word are kept in order by position in the text. Figure 1 illustrates the structure of the index for an example of a text database with six words. Each entry of the table of words contains a word and a pointer to the end of its list of occurrences. A brief description of the index construction can be found in Section 3. Table of words: (Main memory)

[2] example [3] of

Text: (Disk)

[2] 19

[1] 1

List of occurrences: (Disk) a

text

[31 8

[41 16

[4) tex t [5] 3

[61

[61 21

^ ^

^

example

of a

text

1 3

8

16 19 21

^ ^ ^

F ig .l. Structure of the index

To answer a query the searching procedure needs only the table of words and the list of occurrences, making the text itself not necessary at all. The size of the vocabulary of any large literary text is very small if compared to the size of the text, and so the table of words can be kept in main memory all the time (more about the size of the vocabulary can be found in Sections 3.2 and 5). For a single word pattern we just perform a search in the table of words for the list of occurrences that contains all the matches of the pattern. When the pattern is more than one word long (phrase pattern) we first search the table for each word of the pattern and retrieve the corresponding lists of occurrences. Next, we obtain the intersection of the lists looking for pointers that have the same relative positions they share in the pattern, thus obtaining the final answer. To illustrate the searching procedure we present two examples. Exact searching for the pattern te x t in Figure 1 involves binary searching the table of words for the list interval (5,6). To search for the pattern te x t

5

sample with editing distance k = 2 in Figure 1 we search with k < 2 errors the first word te x t of the pattern and obtain one list interval (5,6) for k = 0. Next, we search with k < 2 errors the second word sample of the pattern and obtain the word example corresponding to the list interval (3,3) for k = 2. Now we end up with the two lists { 8 } and {3,21} corresponding to the list intervals (3,3) and (5,6). The final answer is the list {3 }, result of the intersection of the two lists, given that te x t and sample are at the proper distance in the pattern. In general we consider all lists related to each word of the query such that the total sum is < k. In the pattern te x t sample we had one list related to the first word te x t with k = 0 and one list related to the word sample with k = 2 .

3 Index Construction The procedure to build the index works as follows. We scan the text, word by word, find each word in a hash table and insert its text position at the end of the corresponding occurrence list. If a word is not present in the hash table, it is inserted and the corresponding occurrence list is initialized. The index is stored on disk in the format shown in Figure 1. However, the resulting index most probably will not fit in main mem­ ory. When the main memory is exhausted, we store the partial index as if it were the complete final index. This partial index is called a dump. We then continue the process starting from scratch with a new dump. Once we complete this process, we merge the dumps. Merging two dumps involves concatenating the lists of occurrences of each word, which takes linear time. Partial dumps are merged until the complete index is obtained. We can merge r dumps in a single process, in a fashion very similar to r-way list merging, at 0 (nlog 2 r) time (e.g. using a heap). We tested different values of r and, although larger values produce better times in a reasonable range, the overall differences are too small to take into account. We perform an in place merging as described in [MB95].

3.1 Time to Build the Index The cost to search a word in the hash table is 0(1) on average. As the text addresses always increase as the scanning goes on, the insertions in the list of occurrences happen always at the end of the lists, at 0 ( 1) cost. Thus, the total CPU cost to build the dumps is 0 (n ) on average. By keeping the words of the vocabulary in a trie instead of a hash table, the time cost can be made O(n) in the worst case. This is because, for each one of the 0(n ) characters of the text, we work 0 ( 1) in the trie.

6

We analyze now merging times. Let M be the amount of main memory available. Given that we can merge groups of r dumps in a single process, we can divide the n/M dumps in groups of r, merge each group and obtain n f(M r) groups of larger dumps. This process is repeated until we have only one final index, as shown in Figure 2. I----1 I__ I I__ I I__ I I__ I I__ I I__ I I__ I I__ I I II II I I I Fig. 2. The process of merging three dumps each time

Original dumps After a 3-way merge Final index

Since the time to merge r groups of size M each is 0 (M r log 2 r), the total amount of work in the first level is 0 (n log 2 r), which is the same for each iteration. Since there are logr (n/M ) iterations, the total amount of time is O (nlog 2(n/M )), which is independent of r. The value of r affects disk times, although the effect is barely noticeable. Therefore the algorithm is 0(n lo g n) on average. However, it can be made 0 (n ) in the worst case. If instead of dumping and merging we keep a separate file for each word in the vocabulary, for each word in the text we must add an occurrence to the end of its file, at 0 (n ) total cost. However, except for huge texts, dumping and merging is more practical because it avoids random accesses to disk. The algorithm could decide which strategy to employ based on the text size, this way keeping 0 (n ) all the time as well as choosing the fastest strategy for each case.

3.2 Space for the Index It is empirically known that the vocabulary of a text with n words grows sublinearly. Moreover, the following relation holds very accurately [Hea78] V = K ifi = 0{nP) ( 1) where. V is the size of the vocabulary and 0 < (3 < 1 is a constant depen­ dent on the particular text. We show later an experimental verification of this fact. Hence, the larger part of the index is the list of occurrences, which is 0 (n ). Stop words represent approximately 30-40% of the text (see Section 5.2 for the definition of stop words). For each non-stop word, we store a pointer (4 bytes is enough in most cases), while the length of non-stop words is approximately 6-7 characters. This fact (that we later verify experimentally) yields 0.35n, i.e. a 35% overhead over the text.

7

4 Querying In our system there are basically two types of patterns: one word patterns and phrase patterns. In each case we can look for exact and approximate occurrences of the pattern in the text. Each of these four combinations involves quite different algorithms and tasks to be performed. Next, we describe the most representative combinations derived from the two basic types of patterns.

4.1 One Word Patterns The most im portant characteristic of one word patterns is that only the vocabulary is consulted and the list or lists of occurrences are immedi­ ately retrieved. For example, simply searching a word retrieves its list of occurrences, searching for a word allowing errors or for a regular ex­ pression may retrieve more than one list as more than one word of the vocabulary may match the query. Searching on the vocabulary can be binary or sequential. Exact searching a word involves a binary search on the vocabulary. Search­ ing a regular expression or approximate searching of a word involves a sequential search on the vocabulary. For simple patterns allowing k errors we use the algorithm [BYN96a], which is 0(n ) for small patterns, and ex­ tremely fast in practice. The algorithm is based on an automaton whose behavior is simulated in 0 ( 1) per inspected character for short patterns. In a 32-bit architecture, words of length up to 9 can be searched in 0 (n ) with any number of errors, and up to length 11 with one error. This is good for our purposes, because most words are not longer than 9 letters in practice. Our experiments show that approximate searching on the vocabulary always takes less than a second with this algorithm. In [BYN96a] a number of techniques are developed to cope with longer patterns. However, we take a different approach here. Since the few words longer than 9 letters will have a few more characters, we truncate them to the first 9 characters and use the algorithm as a filter. Each occurrence reported by the filter is checked with dynamic programming to verify if it involves a real match of the complete word. As it is shown in [BYN96a], the number of verifications is extremely low if the error ratio is reasonably small. It is also shown that there is an abrupt division in the domain of error ratios. There is a point such that any query allowing more than that error ratio will retrieve a huge amount of information. Since this is of no use in terms of information retrieval (because of lack of precision) we focus only in the case of lower

8

error ratios. It is possible to estimate beforehand the size of the result (to give the user early feedback on the precision of his query) at very low cost. This automaton can have not only single letters in the pattern, but any set of characters at each position. This allows our system to support very efficiently the following extended queries (exactly or allowing errors): — range of characters (e.g. t[ a - z ] x t, where [a-z] means any letter between a and z); — arbitrary sets of characters (e.g. t [aei] x t meaning the words ta x t, te x t and tix t) ; — complements (e.g. t[~ a b ] x t, where ~ab means any single char­ acter except a or b; t[ ~ a - d ] x t, where ~ a -d means any single character except a, b, c or d); — arbitrary characters (e.g. t-x t means any character as the second character of the word); — case insensitive patterns (e.g. Text and te x t are considered as the same words). For more complicated patterns, allowing k errors or not, we use the algorithm [WM92], which is 0 (k n ) (and 0 (n ) with no errors). Processing the vocabulary with this algorithm takes typically 1-4 seconds. In addi­ tion to single strings of arbitrary size and classes of characters described above the system supports patterns combining exact matching of some of their parts and approximate matching of other parts, unbounded number of wild cards, arbitrary regular expressions, and combinations, as follows: — unions (e.g. t ( e |a i ) x t means the words te x t and ta ix t; the ex­ pression t (e | a i)* x t means the words beginning with t followed by e or a i zero or more times followed by x t). In this case the word is seen as a regular expression; — arbitrary number of repetitions (e.g. t (ab)*xt means that ab will be considered zero or more times). In this case the word is seen as a regular expression; — arbitrary number of characters in the middle of the pattern (e.g. t# x t, where # means any character considered zero or more times). Note that # is equivalent to •* (e.g. t# x t and t-* x t obtain the same matchings but the latter is considered as a regular expression). In this case the word is not considered as a regular expression for efficiency because the treatm ent of a regular expression generally demands more bitwise operations than the # case;

9

— combining exact matching of some of their parts and approximate matching of other parts (e.g. xt, with k = 1, meaning exact occurrence of te followed by any occurrence of x t with 1 error); — matching with nonuniform costs (e.g. the cost of insertions can be defined to be twice the cost of deletions).

4.2 Phrase Patterns For patterns containing more than one word we search each word sepa­ rately on the vocabulary and then intersect the lists of occurrences. Each word of the phrase can be a simple word or a complex regular expression, and can allow errors as in Section 4.1. Exact searching a phrase involves searching each word on the vocabulary and intersecting the lists of occur­ rences. The final answer contains the intersection of the lists represented by that positions in the text with the same relative positions presented by the words in the pattern. It is also possible to search a phrase allowing k errors in the whole phrase. This involves sequential searching of each word on the vocabulary with k errors and intersecting the lists of occurrences, taking care of the total number of errors. We keep a list of matches for each word and each number of errors and intersect each combination that has less than or equal to k total number of errors. For each word of the pattern a different algorithm is chosen, according to the many possibilities described in the previous section. The intersection of many lists is carried out as follows: the shortest list is selected as a first version of the result. Then, it is intersected with each other list by binary searching the elements of the shorter lists inside the other (taking care of the positions of the words in the text). This works well because, as shown in the Appendix, it is very probable that one of the lists is very short.

4.3 Time to Answer a Query In the Appendix we analyze each type of query. We use a as a shorthand for 1 — /?, and observe that 0 < a < 1. In natural language text (3 is between 0.4 and 0.6, hence a « /3 (see Section 5). The results are approximate (since the text models are only approximations) and valid for queries that have a reasonable degree of precision (i.e. queries useful to the user). As explained in the Appendix, 7 is related to the number of allowed errors and is typically in the range 0.1 to 0 .2 . — Simple words: 0 (log 71).

10

— Phrases of simple words: o(n@) for two words, O(log n) for longer phrases. — Extended patterns, regular expressions and approximate words: 0 (n & + n a+7 log n). — Phrases of the above patterns: 0 (n & + nQf+'ylogn). — Approximate phrase matching: 0{v}3 + na+,ylog»). Therefore, except for some types of exact searches, retrieval times are in the range 0 (n 0,4”0 8) depending on the vocabulary size and the complexity of the search. In reasonable cases it is 0 (n 0,6), which is near 0(^/n). We also point out that the disk accesses to the index are sequen­ tial (except for buffering limitations).

5 Experim ental Results For the experimental results we used literary texts from the 2 giga­ bytes TREC collection [Har95]. We have chosen the following texts: AP Newswire (1989), d o e - Short abstracts from d o e publications, f r - Fed­ eral Register (1989), w s j - Wall Street Journal (1987, 1988, 1989) and ZIFF - articles from Computer Selected disks (Ziff-Davis Publishing). We also derived two other larger files by putting together AP plus ZIFF texts (called AZ text file) and AP plus d o e plus FR plus WSJ plus ZIFF texts (called ADFWZ text file). Our objective here is to obtain two large files containing 458.2 megabytes and 1.09 gigabytes, respectively. Table 1 presents some statistics about the seven text files. For the WSJ file the vocabulary size (in bytes) is 0.58% of the text size and the number of words of the vocabulary is 0.49% of the total number of words. For our experiments we considered a word as a contiguous string of characters in the set {A..Z, a..z} separated by other characters not in the set {A..Z, a..z}. The performance evaluation of the algorithms presented in the previ­ ous sections was obtained by means of 500 trials to query different text files and 20 repetitions to build indices. This gives a confidence interval of 95% for our measures. The experiments show that our index is very efficient even for very large text files. All tests were run on a s u n SparcStation 4 with 128 megabytes of RAM running Solaris 2.5.1. 11

Files AP DOE FR WSJ ZIFF AZ ADFWZ

Vocabulary Text Size (bytes) Words Words Size (bytes) 237,766,005 37,740,089 1,530,192 201,115 180,515,212 27,124,239 1,795,783 211,196 219,987,476 32,000,223 1,043,869 132,129 262,757,554 40,741,508 1,511,951 198,818 242,660,178 38,047,824 1,639,677 216,482 480,426,183 75,787,913 2,574,518 336,716 1,143,686,425 175,653,883 4,629,371 573,661

Vocab./Text Size Words 0.64% 0.53% 0.99% 0.78% 0.47% 0.41% 0.58% 0.49% 0 .68% 0.57% 0.54% 0.44% 0.40% 0.32%

Table 1. Text files from the TREC collection

5.1 Time to Build the Index Table 2 presents the times to build the index for three different files containing 250.6, 458.2 and 1090.7 megabytes of text, respectively. The third column shows the time devoted to merging times. In this case the times were obtained for a 2-way merge (i.e. r = 2). As can be seen, the indexing times are almost linear with the size of the text. In our machine, indexing performance is near 4 megabytes per minute. File

WSJ AZ

a d fw z

Size (megabytes) Total time (min) Merge time (min) Mb/min 4.28 250.6 13.8 58.5 3.73 458.2 33.9 122.7 4.38 1090.7 248.9 79.8

Table 2 . Experimental results to build the index

5.2 Space for the Index Table 3 presents the worst case and average case (n /V ) for the sizes of the lists of occurrences for the texts a p , d o e , f r , w s j , z i f f , a z and a d f w z . Note th at in all seven texts the largest size for the list of occurrences corresponds to the word the. The majority of the most common words in natural languages are function words (also called stop words) whose purpose is mainly syntacti­ cal and do not carry enough content to occur alone in the query. An inter­ esting study of English texts by Miller, Newman and Friedman [MNF58] classifies the words into function words (articles, prepositions, pronouns, numbers, conjunctions and auxiliary verbs) and content words (nouns, verbs, adjectives and most adverbs).

12

Files AP DOB FR WSJ ZIFF AZ ADFWZ

Words Text(n) 37,740,089 27,124,239 32,000,223 40,741,508 38,047,824 75,787,913 175,653,883

Words Most freq. word n Index size V (bytes) Occ. Voc.(V) Word 201,115 the 2,077,987 188 152,490,548 211,196 the 1,722,275 128 110,292,739 132,129 the 2,066,443 242 129,044,761 198,818 the 2,020,113 205 165,989,934 216,482 the 1,556,762 176 153,830,973 336,716 the 3,634,749 225 305,726,170 573,661 the 9,443,580 306 707,244,903

Index Text

0.64 0.61 0.59 0.63 0.63 0.64 0.62

Table 3. Size of the lists of occurrences, including stopwords

Table 4 presents the influence of a set of 361 function words obtained from [MNF58] in the five files. For the WSJ file, the 361 words, which are less than 0.18% of the vocabulary of 198,818 words, account for 44% of all 40,741,508 word occurrences (our software is case sensitive so we considered each stop word twice, starting with lower case and upper case letters). By eliminating function words the worst and average lengths of the lists of occurrences are much closer to what actually happens in prac­ tice, as we always try to use content words when retrieving information from text databases. Moreover, our index takes approximately 35% of the space of the text when the stop words are not indexed, which is the option in general for information retrieval systems. Files AP DOE FR WSJ ZIFF AZ ADFWZ

Words Text(n) 20,678,146 15,515,153 17,526,092 22,833,202 21,197,303 41,875,449 97,749,896

Index size Most freq. word Words n V (bytes) Occ. Voc.(10 Word 84,239,271 504,998 103 said 200,392 210,523 energy 61,748 73.7 63,853,146 131,457 Section 104,490 133 71,145,018 303,618 106 92,841,139 said 198,079 software 110,723 98.2 86,425,334 215,753 576,987 125 170,072,682 said 335,974 885,374 171 395,625,209 said 572,903

Index Text

0.35 0.35 0.32 0.35 0.36 0.35 0.35

Table 4. Size of the lists of occurrences, excluding stopwords

5.3 Time to Answer a Query The experiments to measure query times considered exact and approxi­ m ate queries (k = 0,1,2,3), phrase patterns containing 1, 2, 3, 4, and 5 words and the texts w s j, a z and a d f w z . The patterns were randomly chosen from the texts, but avoiding patterns containing function words.

13

We tested our software against Glimpse version 3.0 [MW93] for the

w s j file, using the same set of queries used for our software package Igrep.

For this experiment we used the option -b, when Glimpse builds an index 16.9% of the size of the text (index size of 42.4 megabytes), allowing faster search. This option forces Glimpse to store an. exact pointer to each occurrence of each word (i.e. a full inverted index), except for some very common words belonging to a stop list it always uses in this case. Results are shown in Table 5. k 0 1 2 3

1 word t r 0.08 0.3% 0.58 0.4% 0.85 0.5% 1.30 0.7%

2 words t r 0.23 0.9% 1.99 1.5% 8.27 5.1% 34.1 17.9%

3 words t r 0.24 1% 2.15 1.6% 4.26 2 .6% 14.6 7.5%

4 words t r 0.28 1% 2.59 1.9% 4.65 2.9% * 11.2

5 words t T 0.34 1% 3.16 * 5.06 * 8.97 *

* Glimpse does not accept queries allowing errors with more than 32 characters Table 5. Igrep searching times in seconds (t) and ratio Igrep/Glimpse (r) for the WSJ text

Tables 6 and 7 show the results using Igrep for the larger files AZ (458.2 megabytes) and a d f w z (1090.7 megabytes), respectively. We did not run Glimpse for these two files because its query times are too long on very large texts. Our approach, instead, works well with texts of 1 gigabyte and more. Igrep k 1 word 2 words 3 words 4 words 5 words 0 0.087 ± 0.004 0.32 ± 0.03 0.33 ± 0.02 0.35 ± 0.02 0.40 ± 0.02 1 0.95 ± 0.01 3.3 ± 0.3 3.8 ± 0.1 4.3 ± 0.1 5.2 ± 0.1 2 1.4 ± 0.1 13 ± 3 7.0 ± 1.0 7.2 ± 0.5 8.2 ± 0.4 3 2.2 ± 0.1 70 ± 21 7.1 ± 0.6 12 ± 2 14 ± 2 Table 6. Searching times (in seconds) for the AZ file text using Igrep

The only case in which our index does not work well is for phrases of two words searched with 3 errors or more. This agrees with the analysis, in the sense that two words are not enough to guarantee that one of them has a sufficiently small list of occurrences. Three errors imply searching both words with three errors, and later intersecting the appropriate lists. A word searched with three errors generates a huge list of matches in the vocabulary.

14

Igrep 5 words 4 words 3 words 2 words 1 word k 0 0.095 ± 0.006 0.46 ± 0.05 0.41 ± 0.03 0.44 ± 0.03 0.48 ± 0.03 8.4 ± 0.2 7.1 ± 0.2 6.1 ± 0.3 6.2 ± 0.6 1.6 ± 0.1 1 15 ± 1 15 ± 2 18 ± 3 38 ± 10 2 2.3 ± 0.1 37 ± 11 34 ± 8 52 ± 12 108 ± 30 3 3.5 ± 0.1 Table 7. Searching times (in seconds) for the 1 gigabyte a d f w z text file using Igrep

A possible solution is to forbid more than 2 errors in a single word of a phrase. Another one involves using the text at query time: instead of generating all the matches of a word with 3 errors, generate those of the other one with zero errors and check directly in the text whether the whole phrase appears with 3 errors. Thus, the huge list of matches is never generated. The following test was for more complicated patterns, as follows: 1. cutive: meaning exact occurrence of exe followed by any occurrence of c u tiv e with k errors. 2 . p ro b # a tic sign#ance: where # means any character considered zero or more times (one possible answer is pro blem atic s ig n i f i ­ cance). 3. < [LMN] ACM># received: meaning a word starting by L, M, or N followed by ACM followed by any character considered zero or more times followed by the word receiv ed (one possible answer is LACM receiv ed). For this example, the search is case insensitive for both Igrep and Glimpse. 4 . e a rl# re tir[a e io u ]# < e n t> program: the #, [] and meaning as before (one possible answer is e a rly re tire m e n t program). 5 . acc [aeiou] *unt compri[ms 3 (e s le n t): pattern is a regular ex­ pression (one possible answer is account com prises). Table 8 presents searching times and ratio against Glimpse using the w sj file for k = 0 , 1, 2 , 3 , for the five patterns above. Table 9 presents experimental results for the values of K and /3 from E q.(l) and 9 from Eq.(2). From the values obtained for 0 we can conclude th at retrieval times are near 0(y/n ) for typical texts.

6 Conclusions and Future Extensions We have presented an indexing scheme capable of retrieving words and phrases, exact and approximate search, using classes of characters and

15

Pattern 1 2

3 4 5

k= 0 t R 0.031 0.004% 3.64 0.048 3.74 0.047 4.60 0.061 10.2 8.4%

k=1 t R 3.18 0.028 12.4 0.1 11.7 0.1 14.5 0.124 22.2 8 .8%

k t 6.92 20.7 19.3 38.0 **

k=3 R t R 0.047 11.4 * 0.126 30.1 0.171 0.133 29.0 0.169 * 0.26 191 ♦* *♦ ♦*

=2

* For Glimpse k must be smaller than the number of characters between < and > ** The number of errors must be smaller than the number of characters of the smallest sequence between ( and ) in a regular expression Table 8 . Igrep searching times in seconds (t) and ratio against Glimpse (A) for the text

WSJ

Text AP DOE PR WSJ ZIFF AZ K 26.8 10.8 13.2 43.5 11.3 9.2 P 0.46 0.52 0.48 0.43 0.51 0.52 9 1.87 1.70 1.94 1.87 1.79 1.85

ADFWZ

4.8 0.56 1.85

Table 9. Experimental results for the coefficients of Heaps and Zipf equations

general regular expressions. It is based on full inverted lists, where the work is done on the vocabulary and the text is not accessed at all. This allows to work with texts stored on remote, slow or removable devices, or even with no text at all. The index is implemented as a software package called Igrep, which is publicly available. Our analytical and experimental results show that the performance of the index is good even for text files of more than 1 gigabyte. The index can be built in linear time and a single pass over the text (in our machine the average indexing speed is 4 megabytes per minute), and takes linear space (35% of the text size is typical). Querying performance is near 0(y/n ) for queries that are useful in terms of precision. Typical times for one gigabyte of text are a few seconds for useful queries. We are currently working on extensions of this index. It is easy to extend the index to handle collections instead of single files, and restrict the queries to some subcollections. Reindexing is also easy, since it is sufficient to index again the files that were added or updated and merge the original and the differential indices, which can be efficiently done and allows to use the original index until the last minute. This lightweight reindexing capability is very good in the Web environment, where changes are continuous but not extensive.

16

We are also studying the best way to handle approximate phrase searching, to compare the current approach to the one of verifying di­ rectly in the text. Finally, we are working on integrating compression techniques, to make the whole index plus compressed text nearly half of the original text [MNZ97].

Acknowledgements We wish to acknowledge the helpful comments of Ricardo Baeza-Yates.

References R. Baeza-Yates. Text retrieval: theory and practice. In Proc. of 12th IFIP World Computer Congress, volume I, pages 465-476, 1992. Elsevier Science. [BYG92] R. Baeza-Yates, G.H. Gonnet. A new approach to text searching. Com­ munications of the ACM , 35(10): 74-82, 1982. [BYN96a] R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string searching. In Proc. CPM’96, Springer-Verlag LNCS, v. 1075, pages 1—13, 1996. [BYN97] R. Baeza^Yates and G. Navarro. Block addressing indices for approximate text retrieval. Tech. Report TR/DCC-97-3, Dept, of CS, Univ. of Chile. Submitted. [BYNST96] R. Baeza- Yates, G. Navarro, E. Sutinen and J. Tarhio. Indexing methods for approximate text retrieval. Tech. Report TR/DCC-97-2, Dept, of CS, Univ. of Chile. [BYP92] R. Baeza-Yates and C. Perleberg. Fast and practical approximate pattern matching. In Proc. CPM’92, Springer-Verlag LNCS, v. 644, pages 185192, 1992. [CL92] W. Chang and J. Lampe. Theoretical and empirical comparisons of ap­ proximate string matching algorithms. In Proc. CPM’92, Springer-Verlag LNCS, v. 644, pages 172-181, 1992. [Cob95] A. L. Cobbs. Fast approximate matching using suffix trees. In Proc. CPM’95, Springer-Verlag LNCS v. 937, pages 41-54, 1995. [GBY91] G. H. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures. Addison-Wesley, 1991. [Har95] D. K. Harman. Overview of the third text retrieval conference. In Proc. Third Text Retrieval Conference (TREC-3), pages 1-19, NIST Special Pub­ lication 500-207, Gaithersburg, Maryland, 1995.

[BY92]

17

J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press, NY, 1978. [LST96] O. Lehtinen, E. Sutinen and J. Tarhio. Experiments on block indexing. In Proc. Third South American Workshop on String Processing (W SP‘96), Carleton University Press International Informatics Series, v. 4, pages 183193, 1996. [MB95] A. Moffat and T. Bell. In situ generation of compressed inverted files. Journal of the American Society for Information Science 46(7):537-550, 1995. [MNZ97] E. de Moura, G. Navarro and N. Ziviani. Indexing compressed text. In R. Baeza-Yates, editor, Proceedings Fourth South American Workshop on String Processing, Carleton University Press International Informatics Se­ ries, Valparaiso, Chile, 1997. [MNF58] G. A. Miller, E. B. Newman and E. A. Friedman. Length-frequency sta­ tistics for written English. Information and Control 1:370-380, 1958. [MW93] U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. Tech. Report 93-34, Dept, of CS, Univ. of Arizona, Oct 1993. [Mye94] E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5):345-374, 1994. Springer-Verlag. [Nav97] G. Navarro. Approximate string matching by counting. Tech. Report TR/DCC-97-1, Dept, of CS, Univ. of Chile. Submitted. [Sel80] P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359-373, 1980. [ST95] E. Sutinen and J. Tarhio. On using 9-gram locations in approximate string matching. In P. Spirakis, editor, Proc. ESA ’95, Springer-Verlag LNCS v. 979, Corfu, Greece, pages 327-340, 1995. [ST96] E. Sutinen and J. Tarhio. Filtration with g-samples in approximate string matching. In Proc. CPM’96, Springer-Verlag LNCS v. 1075, pages 50-61, 1996. [Ukk85] E. Ukkonen. Finding approximate patterns in strings. Journal of Algo­ rithms 6:132-137, 1985. [Ukk93] E. Ukkonen. Approximate string-matching over suffix trees. In Proc. CPM’93, Springer-Verlag LNCS 684, pages 228-242, 1993. [WM92] S. Wu, U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91. [WMM96] S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approxi­ mate limited expression matching. Algorithmica, 15(l):50-67, 1996. [Zipf49] G. Zipf. Human Behaviour and the Principle of Least Effort. AddisonWesley, 1949. [Hea78]

18

Appendix: Times for Different Querying Operations There are a number of different types of query to analyze. Each type in­ volves carrying out different tasks. For example, simply searching a word involves a binary search on the vocabulary; searching a phrase involves binary searching each word and then intersecting the lists; searching for a regular expression involves a sequential search on the vocabulary plus merging the resulting lists. We remark that this analysis is approximate, since it relies on rules such as the Heaps law or the Zipf law, which are only rough approxi­ mations to the statistical structure of texts. Moreover, the results are valid only for queries useful to the user (i.e. with reasonable degree of precision). We first analyze the cost of each task, and use the results to deduce the cost of each type of query. The description of the tasks follow, together with their analysis. Recall that the size of the vocabulary is V = 0{vP ) and th at a = 1 —/3, where a and (3 are normally in the range 0.4 to 0.6 [Hea78]. b in -search Binary searching a word in the vocabulary and retrieving the list. Since the search is binary, we have O (log?/) = O(logra) cost for this type of task. seq -search Sequential searching a word in the vocabulary is 0 (n /3). This is the case of regular expressions and others. It is also the case of approximate simple word matching, since, as explained, we use a linear-time filter and the number of verifications is not significant in practice. Searching a complex expression with k errors, on the other hand, is 0{kvP). We take & as a constant. lst-m erg e List merging of j lists occurs in approximate search, non­ standard patterns, etc. Since the average size of each list of occur­ rences is n /V = 0(rc1-^) and we merge ordered lists to produce an ordered list, we work 0 (n aj log j). lst-in te rs List intersection of j lists occurs in phrases. Those lists can come from searching simple words or complex expressions. In the latter case we use an algorithm similar to that of list merging to achieve 0 ( ji\o g j) for j lists of length L In the case of simple words we select the smaller list and, for each element, binary search adjacent positions in the other lists. We show that in this case the length of the shortest list is 0 ( 1) on average, so we work 0 ( j log n) on average. To show that, we assume valid the generalized Zipf law

19

[Zipf49, GBY91], which says that the number of occurrences of the i-th more frequent word is, for some 0 dependent on the text, /(*) =

where

=

(2)

which is constant for 0 > 1. We experimentally validated this law in Section 5. It must hold f(V ) = 0(1), i.e. there exist words that appear once (there are a lot, in fact). Under the model V = 0(vP ) we have 0 = 1/(3 for sufficiently large texts (e.g. a d f w z ) . If we consider X \..X j the rank of the words present in a phrase (which are uniformly distributed over [1..V]), we have P (m m f(X i) > a) = ( P ( /( * i) > a )y

Hence, the expectation of the shortest list is S

' ' -1 1 "

s , )

s

which is 0(1) for j > 0. This is typically out of question for phrases of three words or more. However, for j = 2 that may not be the case. Bounding the summation with an integral, we get that the ex­ pectation is smaller than 1 + Vl ~H9/ ( j —l) + 0 ( l / j ) = In that case the total cost of the intersection is 0 ( n ^ ~ ^ logrc). Observe that we took j (phrase length) as a constant). We now point out the times for each type of query, as follows: - Simple words: A bin-search taking O(logn). — Phrases of j simple words: the time is 0 (j\o g n ), which is both the time to search each word (j bin-searches) and to intersect the lists (lst-in ters). For j = 2 the time can be 0 ( 7i ^ 1- 2^ lo g 7i), which is

o(n^).

20

Proximity Queries In Metric Spaces Chdvez, Edgar1 Marroquin, J.L .2 1 Universidad Michoacana, M6xico 2 Centro de Investigaci6n en Matemdticas, Mexico.

A bstract. In this work we present a novel approach to analyze the be­ havior of triangle-inequality-based (TIB) proximity matching algorithms. Any TIB algorithm eliminates points outside certain ” interesting” region iising the triangle inequality. We propose a form of this elimination in what we call an inner query. The inner queries are range queries for dictionary points. We present two algorithms, based on inner queries, solving the nearest neighbor (NNNQ) and range search (NQR) problems. The space complexity of both NNNQ and NQR algorithms is driven by the space complexity of the inner query algorithm. We present a for­ mulation of the inner query algorithm using 0 (n2) space, though better formulations are likely to be found.

1 Introduction In this work we will analyze the problem of satisfying proximity queries in gen­ eral metric spaces. Seminal papers leading to, in a sense, optimal algorithms have been written since the very formulation of the problem. However, those algorithms use large amounts of memory (see below), and as new computer ap­ plications are developed, more competitive algorithms are demanded. Among the difficulties arising are the very large number of elements in the data set, the high dimensional of the data and the absence of coordinates for indexing the data set. Our goal is to show that there is a simple, yet powerful, way to measure the tractability of a given data set and to present new simple algorithms solving proximity queries. In simple words, our aim is to show that the histogram of distances between data elements provides a way to predict how a competitive algorithm will behave on the average. Let us begin with some formal definitions. The set X will denote the universe of fair or valid points, a finite subset of them U will be called the dictionary, x 6 X stands for the query point and d a measure of closeness or similarity between points in X. Similarity functions have the following properties: — d(u,v) > 0 positiveness — d(u,v) = d{v,u) symmetry — d(u, u) = 0 reflexivity and in most cases

21

— d(u,v) = 0 iff u —v strictly positiveness There are interesting computer applications where no other property is asked for the similarity function, however no sublinear solutions have appeared for this problem. This is the more general framework in where the proximity problem could be stated. A nearest neighbor query consist in finding the point u in HJ closest to x under the similarity function d. A r—near neighbor or a query of range r consist in finding all the points u in U such that d(u,x) < r. Its clear that in a single pass of the data set, i.e. in linear or 0(n) operations, both queries could be answered. However, authors agree that a more realistic measure of the complexity involved is the number of similarity calculations used to answer the query, because the similarity function could be really expensive to compute. The similarity properties enumerated above only ensure a consistent defi­ nition of the function, and cannot be used to discard impossible points in a proximity query. This could be the reason for the failure to design a sublinear algorithm for the similarity space model. If d is indeed a distance, i.e. if it satisfies the triangle inequality — d(u,v) < d(u,w) +d{w,v) then the set X is called a metric space and in this model several branch and bound techniques and data structures are known to solve proximity queries in a sublinear number of distance computations (see next section). A more restricted universe is when the elements of the metric space X are ?7vtuples of real numbers, i.e. when X is a vector space. In this case the metric in turn is usually an lv metric; often with p = 1, the city-block distance, with p = 2, the euclidean distance and p = oo, the maximum distance. We will not discuss this model in this work, for a survey on proximity queries in the vector space model the reader should read [8, 2], A final word before discussing related work: there is a really surprising result in both vector space and metric space models; this result could be loosely stated in a single sentence: The average number of distance calculations necessary for answering proximity queries is constant i.e. does not depend on the number of the elements of the data set. This result was first suspected from experimental evidence in Vidal’s work [7]. Nevertheless there is a formal proof of the above assertion due to Faragd et al [4] for the metric space model. Before them, for the vector space model, Bentley et al developed the bucketing methods [1, 2] (which are also constant­ time algorithms for the vector space model). Unfortunately the bucketing algorithm uses exponential (in the dimension) preprocessing time and space, while Vidal’s algorithm uses quadratic (in the size of the dictionary) preprocessing time and space. Both solutions are unrealistic from the memory usage point view, if the number of data points and/or the dimension of the space are large.

22

Additionally the bucketing methods have an exponential dependence on the dimension of the vector space X. A similar limitation appears in Farag6’s al­ gorithm: the number of distance calculations depends exponentially on certain parameter used in the technical proof of the theorem. This parameter is the key to understand the behavior of this algorithm, however very little discussion is done about it in Farag6’s paper. We are interested in Vidal’s algorithm as an experimental evidence of con­ stant average complexity for proximity queries. In this paper we will analyze Vidal’s algorithm from a novel perspective. In this analysis the distribution of distances between elements of the data set will play a central role. We will also present and analyze our algorithms solving both nearest neighbor and range search problems. These algorithms are based on Vidal’s work, yet they are simpler and we will show experimental evidence of their good behavior. The plan of the presentation is as follow: Section 2 presents a short review in proximity queries in the metric space model, section 3 a discussion on our approach, section 4 experimental results and section 5 conclusions.

2 Related Work All the approaches discussed in this review are off-line, because a preprocessing step must be performed before any query can be answered; and as a matter of fact no efficient solutions have appeared for the on-line formulation of the problem. This review is divided in two parts, the first for range search and the second for the nearest neighbor problem.

2.1 Related Work On Range Search In this category we can find the metric trees [6], where the idea is to build a partition of the space selecting a point and then dividing the entire data set in two parts, the closest to the vantage point (half of them) and the farthest (the other half). This basic procedure is repeated recursively to build a binary tree supporting proximity queries. Average complexity is argued to be logarithmic on the size of the data set, this assertion is supported by experiments on binary vectors under hamming distance. This could be true if no backtracking where performed, but in experiments performed both branches need to be searched in most cases; specially when the searching range and the dimension increases. A similar argument is found in Brin’s paper [3] with the GNATS data struc­ ture, where k points are selected to build a partition of the space. Every element of the dictionary is assigned to the nearest of the split points, this procedure de­ fines domains of influence. Each domain is then recursively partitioned with the samfi procedure; yielding a k-Biy tree. To perform queries in this tree, at each level we search for the (possible many) domains where it is possible to find dictionary points satisfying the query range. Branches without intersection with the interesting region are safely pruned. Again, if the range of the search increase fewer branches can be pruned.

23

In both algorithms it is not clear what to do if the range of the search is unknown. In other words, there is not a direct way to extend the algorithms to perform a nearest neighbor search. Another viewpoint is given by the fixed queries trees (FQT) by Baeza-Yates et al [5] . Even if the algorithm works well in any metric space, a discrete one (the distance function takes only bounded integer values) is selected to explain the algorithm. Here random points (the keys) in the metric space are selected to build a tree. At the root node the entire data set is divided into to 4 - 1 subsets, each subset corresponding to points at distance 0, • *,m from the first key. In the next level all points are compared with the second key, and each subset is again subdivided into 771+1 groups and so on, until the smaller group (a bucket) has a desired number of elements, after this procedure we have k keys {s*}. The leaves are groups of points that share the same distances from the keys (and in the same order). Queries of range r are then performed searching at each level all branches with labels s such that e;- —r < s < + r,with e;- = d(x,Sj). Both analytical bounds and experimental performance are shown, and sublinear average number of distance calculations is obtained. In the continuous case each branch of the FQT is associated with a range of distances instead of a single number. In this data structure there is a simple way to build an algorithm for nearest neighbor search, and even if not treated in Baeza-Yates et al it is easy to deduct. This procedure is described in the next subsection. All the above algorithms perform a number of distance calculations increasing with the range of the search; this is a natural behavior since there is a range in which all the data set is included. In the revised papers, the role of dimensionality in the behavior of the algo­ rithms is not analyzed. Even more, the GNATS schema is explicitly tailored for "high, dimensional metric spaces” but there is not definition or discussion of what is meant by ’’high dimensional spaces”. In Brin’s paper the only reference to high dimensionality is (not textually) ” those spaces in where little increments in the range of the search produces large increments in the number of points inside the range1*. No empirical analysis is done to understand the role of dimensionality. This lack of results in this direction is perhaps due to the absence of a clear measurement of the dimension in a general metric space. In a vector space is clear that the number of coordinates gives a proper measurement of the Hinrn>n«inn (this is perhaps the definition of dimensionality) whereas in a metric space is not clear when it is high or low dimensional. A more detailed discussion of this ideas is given in section 3.

2.2 Related Work On Nearest Neighbor Search Farag6 et al [4] discusses the analysis of a family of algorithms for nearest neigh­ bor search. In its simpler form k points are selected, say 2%. This points are called a base at level (a, 0) if the triangle inequality is valid modulo a i.e. |d(utZ{) — d(v,Zi)\ < ad(u,v) and max {\d(u,pi) — d(v,Pi)|} > fid(u,v) i.e.

24

max !•) is a metric. In the preprocessing stage all distances d(u,Zi), u in

i m —k)< 1/m 3. Using Theorem 6, it is possible to show that the expected r u nn in g time of the checking phase of LET and DLET is 0(n). Since the preprocessing and filtering phase requires 0(n) time, it follows that the expected running time of LET is 0(n). The same holds for the improved version DLET, but the bound for k can be weakened; see [5]. 3.4 Experim ental Results In our experiments we verified that our dynamic filtering technique, when ap­ plied to the static filter algorithms LET and SET, leads to an improved critical threshold kmax (cf. the introduction) for all alphabet sizes and pattern lengths. In a first test series we used random text strings T of length n = 500,000 over alphabets of size 2,4,10 and 40. We chose patterns of a fixed length m = 64 over the same alphabets. Figures 2 and 3 show the effect of a varying threshold value k on the filtration efficiency f = (n —np)/n, where np is the number of positions in T left for dynamic programming. It can be seen that in order to achieve a particular filtration efficiency, algorithms DLET and DSET allow for a larger value of k than algorithms LET and SET, respectively. The advantage is independent of the alphabet size. Figure 4 shows the effect of the alphabet size on /cmax. All algorithms achieve a larger value of kmgx with growing alpha­ bet size. The dynamic filter algorithms are always superior to the static filter algorithms.

46

Fig. 2. Filtration efficiency for fixed pattern length m — 64 (LET and DLET)

rurbar of d tteroncu k

Fig. 3. Filtration efficiency for fixed pattern length m = 64 (SET and DSET)

In a second test series T was an English text of length n = 500,000. The alphabet size was |>t| = 80. We chose pattern of length m € {16,32,64,128} over the same alphabet. Figures 5 and 6 show how a varying pattern length effects the filtration efficiency. It can be seen that in order to achieve a particular filtration efficiency, algorithms DLET and DSET allow for a larger value of k than algorithms LET and SET, respectively. The larger the pattern, the larger the improvement. Figure 7 shows the effect of the pattern length on kmax. All algorithms achieve a larger value of kmax with m becoming larger. The dynamic filter algorithms are always superior to the static filter algorithms. There is virtually no time penalty for the complex dynamic filter algorithms, if the filtration efficiency is almost 100%. When the static filter looses its effect, while the dynamic is still filtering, the latter is much faster than the former. Fi­ nally, if k > kmaxi then the dynamic filter has a considerable overhead. However, in this case, pure dynamic programming is preferable anyway.

47

Fig. 5. Filtration efficiency for fixed alphabet size |i4| = 80 (LET and DLET)

Fig. 6. Filtration efficiency for fixed alphabet size |A| = 80 (SET and DSET)

4 Dynamic Filtering applied to LEQ Dynamic filtering is a general idea which can be applied to other known static filter techniques, e.g. [16, 14, 3,13, 11,12]. We exemplify this claim by showing how it can be applied to Sutinen and Tarhio’s algorithms LEQ and LAQ [11,12].2 We first briefly recall the basic idea of LEQ and LAQ. Assume that q and s are positive integers and define h = m~k+a \ • LEQ and LAQ take ^-samples

J.

Tsam(j) = T\jh - q + 1 .. .jh] for all j € j l , 2,..., |^ J from the text T. Suppose v = T[b... e] is an approximate match, that is, edist(P, v) < k. If for the sampling step h the inequality h > q holds, then v 2That is, it also applies to Takaoka’s method with a > 1 instead of s = 1 g-samples taken from the text.

48

contains at least k + s consecutive g-samples. Moreover, at least s of these must occur in P—in the same order as in v. This ordering is taken into account by dividing P into blocks Q 1,..., Qk+ai where Qi = P[(i —1)h+ 1.. .ih + q - 1 + k]. LEQ and LAQ are based on the following theorem which we cite from [12] (cf. also [11]): If Tsam(ja) is the leftmost g-sample of the approximate match v = T[b... e], then there is an integer t, 0 < t < k 4- 1, such that the k + s consecutive g-samples Tsam(Ja + 1 + 1),. . . , Tsam(ja + k + s + t) are contained in v and Tsam(ja + I + t) € Qi holds for at least s of the samples. (We write u € v, if string u is a subword of string v.) Defining = j a + 1, LEQ can be formulated as follows: A lgorithm LEQ. If, for k + s consecutive (jr-samples Tsam(ji, + 1)}Tsam(jb + 2),.. . 7Tsam(jb + k + s), we have Tsam(jb + I) € Qi for at least s indices I, 1 jh and derive a contradiction. There is a n i, 0 < i < m such that D(i,jh) < k and 6 = jh — \W(i,jh)\ 4 1. Obviously, W(i,jh) is a prefix of v, i.e., it contains the 9-samples Tsam(jb 4 1 ) ,..., Tsam(jb 4 k 4 s),..., Tsam(j) (2) Since j > jb 4 k 4 s, the sequence (2) is of length at least k + s + 1. Hence \W(i,jh)\ > (k 4 s)h 4 q. Let I — 4 l j . Then I > k 4 s 4 1 which

50

implies G D (j,l,k+ s-l) =■0. Thus we conclude D(i,jh) + GD(j,lxk + s -l) < k, which means that CPM(j) holds. This is a contradiction, i.e., e < jh is true. □ Like LEQ (see [11]), an efficient implementation of DLEQ utilizes the shiffcadd technique of [1]: for each j, 0 < j < |_§J a vector Mj is computed, where M (i\ — / Xw=o viTsamtf —/) € Qi-i) if i < j \ 0 otherwise One easily shows that Mj+i(i + 1) = Mj(i) 4- ip(Tsam(j 4-1) G Qi+i) and GD(j, l,r) = r —Mj+r(r) hold. As a consequence (i) Mj +1 can be obtained from Mj by some simple bit parallel operations (provided P is suitably preprocessed, see [11] for details), (ii) BPM(j) can be decided in constant time, and (Hi) CPM(j) can be decided in 0(m) time. Thus, the dynamic checking in DLEQ requires G(mnfh) time in the worst case. 4.1 Dynamic Filtering Applied to LAQ A dynamic version of algorithm LAQ [12] is easily obtained from the above algorithm. Instead of counting one difference whenever Tsam(j + 1) & Qi (like LEQ does), LAQ uses the asm distance introduced by Chang and Marr [3] in order to obtain a better lower bound for the guaranteed differences. Let asm(u, B) denote the edit distance between string u and its best match with a subword of string B. LAQ is obtained from LEQ by simply replacing GD(jtl,r) = E L i (p{Tsam(j + y) i Qy) with GD(j, I, r) = Y,y=i asm(Tsam(j + y), Qy). Since Tsam(j+y) 1, LAQ uses a stronger filter than LEQ. The price to be paid, however, is that either tables asm(u}Qi), 1 < i < k + s, have to be precomputed for every string u of length q, or the required entries of the tables have to be computed on demand.

5 Conclusion Although the technical details are different in each case, we have shown that our approach is a general technique for the improvement of filtering methods in approximate string matching. By analogy, we may plan a car route through a crowded city, based on advance information of traffic congestion at various points. There is always a chance to improve our routing decisions based on the traffic we have observed so far, still using advance information about the route ahead. This does not make much difference in very low traffic (practically no matches) or in times of an overfull traffic overload (matches almost everywhere). But there is always a certain level of traffic where the flexibility added by our method makes us reach the destination before our date has gone.

51

References 1. R.A. Baeza-Yates and G.H. Gonnet. A New Approach to Text Searching. Com­ munications of the ACM, 35(10):74-82, 1992. 2. W.I. Chang and E.L. Lawler. Sublinear Approximate String Matching and Bio­ logical Applications. Algorithmica, 12(4/5):327-344, 1994. 3. W.I. Chang and T.G. Marr. Approximate String Matching and Local Similarity. In [6], pages 259-273, 1994. 4. A. Ehrenfeucht and D. Haussler. A New Distance Metric on Strings Computable in Linear Time. Discrete Applied Mathematics, 20:191-203, 1988. 5. R. Giegerich, F. Hischke, S. Kurtz, and E. Ohlebusch. Static and Dynamic Filtering Methods for Approximate String Matching. Report 96-01, Technische Fakultat, Universitat Bielefeld, 1996. URL: http://www.techfak.unibielefeld.de/techfak/ags/pi/ Agpi/pu blications.html. 6. D. Gusfield, editor. Proceedings of the Fifth Annual Symposium on Combinatorial Pattern Matching, Asilomar, California, June 1994• Lecture Notes in Computer Science 807, Springer Verlag, 1994. 7. D.S. Hirschberg and E.W. Myers, editors. Proceedings of the 7th Annual Sympo­ sium on Combinatorial Pattern Matching, Laguna Beach, California, June 1996. Lecture Notes in Computer Science 1075, Springer Verlag, 1996. 8. S. Kurtz. Fundamental Algorithms for a Declarative Pattern Matching System. Dissertation, Technische Fakultat, Universitat Bielefeld, available as Report 95-03, July 1995. 9. E.M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Jour­ nal of the ACM, 23(2):262-272, 1976. 10. P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980. 11. E. Sutinen and J. Tarhio. On Using g-Gram Locations in Approximate String Matching. In Proceedings of the European Symposium on Algorithms, pages 327340. Lecture Notes in Computer Science 979, Springer Verlag, 1995. 12. E. Sutinen and J. Tarhio. Filtration with g-Samples in Approximate Matching. In [7], pages 50-63, 1996. 13. T. Takaoka, Approximate Pattern Matching with Samples. In Proceedings of ISAAC 1994, pages 234r-242. Lecture Notes in Computer Science 834, Springer Verlag, 1994. 14. J. Tarhio and E. Ukkonen. Approximate Boyer-Moore String Matching. SIAM Journal on Computing, 22(2):243-260, 1993. 15. E. Ukkonen. Finding Approximate Patterns in Strings. Journal of Algorithms, 6:132-137, 1985. 16. E. Ukkonen. Approximate String-Matching with g-Grams and Maximal Matches. Theoretical Computer Science, 92(1):191-211, 1992.

52

Distributed Generation of Suffix Arrays: a Quicksort-Based Approach Joao Paulo Kitajima13 Gonzalo Navarro24 Berthier A. Ribeiro-Neto15 Nivio Ziviani16 1 Dept, of Computer Science, Federal University of Minas Gerais, Brazil. 2 Dept, of Computer Science, University of Chile, Chile. 3 This author has been partially supported by CNPq Project 300815/94-8. 4 This author has been partially supported by Fondef grant 96-1064 (Chile). 5 This author has been partially supported by CNPq Project 300188/95-1. 6 This author has been partially supported by CNPq Project 520916/94-8 and Project Ritos /C yted .

Abstract. An algorithm for the distributed computation of suffix ar­ rays for large texts is presented. The parallelism model is that of a set of sequential tasks which execute in parallel and exchange messages between each other. The underlying architecture is that of a highbandwidth network of processors. In such a network, a remote mem­ ory access has a transfer time similar to the transfer time of magnetic disks (with no seek cost) which allows to use the aggregate memory distributed over the various processors as a giant cache for disks. Our algorithm takes advantage of this architectural feature to implement a quicksort-based distributed sorting procedure for building the suffix ar­ ray. We show that such algorithm has computation complexity given by 0(rlog(n/r) + n/r log r log n) in the worst case and 0(nfr logn) on av­ erage and communication complexity given by 0(n/r log2 r) in the worst case and 0(n/r logr) on average, where » is the text size and r is the number of processors. This is considerably faster than the best known sequential algorithm for building suffix arrays which has time complex­ ity given by 0(n2/m) where m is the size of the main memory. In the worst case this algorithm is the best among the parallel algorithms we are aware of. Furthermore, our algorithm scales up nicer in the worst case than the others.

1 Introduction We present a new algorithm for distributed parallel generation of large suffix arrays in the context of a high bandwidth network of processors. The motivation is three-fold. First, the high cost of the best known sequential algorithm for suffix array generation leads naturally to the exploration of parallel algorithms for solving the problem. Second, the use of a set of processors (for example, connected by a fast switch like ATM) as a parallel machine is an attractive alternative nowadays [1]. Third, the final index can be left distributed to reduce the query time overhead. The distributed algorithm we propose is based on a

53

parallel quicksort [7, 13]. We show that, among previous work, our algorithm is the fastest and the one that scales best, in the worst case. The problem of generating suffix arrays is equivalent to sorting a set of unbounded-length and overlapping strings. Because of those unique features and because our parallelism model is not a classical one, the problem cannot be solved directly with a classical parallel sorting algorithm (we review related work in Section 3). The proposed algorithm is based on the recursive parallel quicksort approach, where a suitable pivot is found for the whole distributed set of suffixes and the partition phase redistributes the pointers of the suffix array so that each processor has only suffixes smaller or larger than the pivot. A generalization of the parallel quicksort was presented in [12], whose central idea is as follows. Consider the global sorted suffix array which results of the sorting task. If we break this array in n /r similarly-sized portions, we can think that each processor holds exactly one such slice at the end. Thus, the idea is to quickly deliver to each processor the index pointers corresponding to its slice. In summary, the generalized parallel quicksort presented in [12] works with r-percentiles obtained in one step, instead of the binary recursive approach based on one pivot used here. It is also worth to mention a previous parallel mergesort based algorithm presented in [9], which is slower than the algorithm presented here on both average and worst cases. 1.1 Suffix Arrays To reduce the cost of searching in textual databases, specialized indexing struc­ tures are adopted. The most popular of these are inverted lists. Inverted lists are useful because their search strategy is based on the vocabulary (the set of distinct words in the text) which is usually much smaller than the text and thus, fits in main memory. For each word, the list of all its occurrences (positions) in the text is stored. Those lists are large and take space which is 30% to 100% of the text size. Suffix arrays [10] or PAT arrays [4, 5] are more sophisticated indexing struc­ tures with similar space overhead. Their main drawback is their costly construc­ tion and maintenance procedures. However, suffix arrays are superior to inverted lists for searching phrases or complex queries such as regular expressions [5, 10]. In this model, the entire text is viewed as one very long string. In this string, each position k is associated to a semi-infinite string or suffix, which initiates at position k in the text and extends to the right as far as needed to make it unique. Retrieving the “occurrences” of the user-provided patterns is equivalent to finding the positions of the suffixes that start with the given pattern. A suffix array is a linear structure composed of pointers (here called index pointers) to every suffix in the text (since the user normally bases his queries upon words and phrases, it is customary to index only word beginnings). These index pointers are sorted according to a lexicographical ordering of their respec­ tive suffixes and each index pointer can be viewed simply as the offset (counted from the beginning of the text) of its corresponding suffix in the text.

54

To find the user patterns, binary search is performed on the array at O(logn) cost (where n is the text size). The construction of a suffix array is simply an indirect sort of the index pointers. The difficult part is to do this sorting efficiently when large texts are involved (i.e., gigabytes of text). Large texts do not fit in main memory and an external sort procedure has to be used. The best known sequential procedure for generating large suffix arrays takes time 0 (n 2/m logm) where n is the text size and m is the size of the main memory [5]. 1.2 D istributed Parallel Com puters Parallel machines with distributed memory (multicomputers or message pass­ ing parallel computers) are a good cost-performance tradeoff. The emergent fast switching technology has allowed the dissemination of high-speed networks of processors at relatively low cost. The underlying high-speed network could be, for instance, an ATM network running at a guaranteed rate of hundreds of megabits per second. In an ATM network, all processors are connected to a cen­ tral ATM switch which runs internally at a rate much higher than the external rate. Any pair of processing nodes can communicate simultaneously at the guar­ anteed rate without contention and broadcasting can be done efficiently. Other possible implementations are the IBM SP based on the High Performance Switch (HPS), or a Myrinet switch cluster. Our idea is to use the aggregate distributed memory of the parallel machine to hold the text. Accessing remote memories takes time similar to that of transferring data from a local disk, although with no seek costs [9].

2 Preliminaries Our parallelism model is that of a parallel machine with distributed memory. Assume that we have a number r of processors, each one storing 6 text positions, composing a total distributed text of size n = rb. Our final suffix array will also be distributed, and a query solved with only O(logn) remote accesses. We assume that the parallelism is coarse-grained, with a few processors, each one with a large main memory. Typical values are r in the tenths or hundreds and b in the millions. The fact that sorting is indirect poses the following problem when working with distributed memory. A processor which receives a suffix array cell (sent by another processor) is not able to directly compare this cell because it has no local access to the suffix pointed to by the cell (such suffix is stored in the original processor). Performing a communication to get (part of) this suffix from the original processor each time a comparison is to be done is expensive. To deal with this problem we use a technique called •pruned suffixes. Each time a suffix array cell is sent to a processor, the first I characters of the corresponding suffix (which we call a pruned suffix) are also sent together. This allows the remote processor to perform comparisons locally if they can be decided looking at the first t characters only. Otherwise, the remote processor requests more characters

55

to the processor owning the text suffix cell7. We try to select I large enough to ensure that most comparisons can be decided without extra communication and small enough to avoid very expensive exchanges and high memory requirements. We define now what we understand by a “worst-on-average-text” (w at) case analysis. If we consider a pathological text such as "a a a a a a the classical suffix array building algorithm will not be able to handle it well. This is because each comparison among two positions in the text will need to reach the end of the text to be decided, thus costing O(n). Since we find such worst-case texts unrealistic, our analysis deal with average random or natural language text. In such text the comparisons among random positions take 0(1) time (because the probability of having to look at more than i characters is 1/ 1). Also, the number of index points (e.g., words) at each processor (and hence the size of its suffix array) is roughly the same. A WAT-case analysis is therefore a worst-case analysis on average text. We perform WAT-case and average-case analysis.

3 Related Work For the PRAM model, there are several studies on parallel sorting. For instance, Jaja et al. [8] describe two optimal-work parallel algorithms for sorting a list of strings over an arbitrary alphabet. Apostolico et al. [2] build the suffix tree of a text of n characters using n processors in 0(log n) time, in the CRCW PRAM model. Retrieval of strings in both cases is performed directly. In a suffix array, strings are pointed to and the pointers are the ones which are sorted. If a distributed memory is used, such indirection makes the sorting problem more complex and requires a more careful algorithm design. The parallelism model we adopt is that of parallel machines with distributed memory. In such context, different approaches for sorting can be employed. For instance, Quinn [13] presents a quicksort for a hypercube architecture. That algorithm does not take into account the variable size and overlapping in the elements to be sorted, as in our problem. Furthermore, the behavior of the com­ munication network in Quinn’s work is different (processors are not equidistant) from the one we adopt here.

4 The Quicksort-Based Distributed Algorithm Our algorithm also utilizes the aggregate memory as a giant cache for disks. Unlike mergesort, the hardest work occurs at the point of higher parallelism. It also improves over the generalized quicksort, because the partitioning is binary and therefore bad biased cases are handled better. Our algorithm starts by determining the beginning of each suffix in the text (i.e., the beginning of each word) and by generating the corresponding index 7 As we will see, in some cases this is not necessary and one might assume that the suffixes are equal if the comparison cannot be locally decided.

56

pointers. Once this is done, the pointers are sorted lexicographically by the suffixes they point to (i.e. the local suffix arrays are built). This task is done in parallel for each of the r blocks of text. Since computation of the whole suffix array requires moving index pointers among processors without losing sight of the suffixes they point to, index pointers are computed relative to the whole text. The processors then engage in a recursive process which has three parts: (1) find a suitable pivot for the whole distributed set of suffixes; (2) partition the array: redistribute the pointers so that each processor has only suffixes smaller or larger than the pivot (keep local arrays sorted), and (3) continue the process separately inside each group of processors. This recursion ends when a partition is completely inside a local processor. Since all the time the suffixes at each processor are sorted up to pruning, the process is completed with (4): a final sorting of equal pruned suffixes inside each processor. We now describe the algorithm more in detail. Let E(i) be the set of index pointers stored in the processor i. Further, let p be a reference to an index pointer and let S(p) be the pruned suffix pointed to by p. 4.1 Finding a Pivot The goal of this stage is to find a suffix which is reasonably close to the median of the whole set, at a low cost. To achieve this, all processors (a) take the middle element m(*) of their local suffix array; (b) broadcast that (pruned) median m(i); (c) knowing all the other medians, do m = median{m( 1),..., m(r)}; (d) binary search the median of medians m in their suffix array, therefore partitioning their index pointers in two sets L(i) and R(i): L{t) = {P E E(t) | S(p) < m}; R(t) = {p € E(i) \ S(p) > m} (1) (e) broadcast the sizes \L(i)\ and \R(i)\ of the computed partitions. Observe that in part (e) a pruned suffix which is found to be equal to the (pruned) pivot m is put at the left partition. This works well and avoids at all requesting full suffixes to other processors. However, as the algorithm progresses, this pivoting process can worsen the randomness of the partition. Such effect tends to get worse at the final stages of the sorting process. We proved in [12] that this median of medians is very close to the exact median, and we show in Section 6 that this is the case in practice, even using pruned suffixes. Notice that it is possible to find the exact pruned median by using the 0 (r log 6) process described in [12]. However this would add a complication to the algorithm and does not change the complexities, as we see later.

57

4.2 R edistributing Pointers The processors engage in a redistribution process in which they exchange index pointers until each processor contains all of its index pointers in either L or R, where

L= |Jl(i); R= U r(«)

(2)

We say that the processor becomes homogeneous when this happens. There can be left at most one processor whose index pointers lie in both L and R (we ex­ plain later how this is accomplished). This processor is called non-homogeneous. The process of redistributing index pointers is carried out in a number of steps which are completely planned inside each processor (simulating comple­ tion times for exchanges) and later followed independently. To accomplish such effect, the processors are paired in a fixed fashion (for instance, pair the proces­ sor (2i) with the processor (2i + 1) for all i). Each pair manages to exchange a minimum number of index pointers such that one of them is left homogeneous. The homogeneous processor in each pair is left outside of the redistribution process. The remaining half processors engage in a new redistribution process in which the processor (4i) or (4i + 1) is paired with the processor (4i + 2) or (4z + 3) (depending on which one is still non-homogeneous). Notice that, since all processors have the information needed to predict the redistribution process, they know which processor to pair with at each iteration, and no syn­ chronization messages have to be exchanged. This ends when there is only one non-homogeneous processor. Let us focus in the task of making one of the processors in a pair homo­ geneous. Consider the pair composed of processors Pa and Pj. By comparing its suffixes with the computed median m, the processor Pa separates its index pointers according to the internal partition (La, Ra). Analogously, the processor Pb separates its index pointers according to the internal partition (Lb, Rb)- Let |La |, |/Ja|, |L&|, and |ify| be the number of index pointers in each of these par­ titions. Without loss of generality, let min(|La|, |J?a|, \Lb\, l-Rftl) = \La\. Then, processor Pa can make himself homogeneous by sending all the index pointers in its partition La to processor Pb while retrieving (from processor Pb) \La\ index pointers of partition Rb. After this exchange, processor Pa is left with all its index pointers belonging to R (and thus, homogeneous) while processor Pb is left with a partition (Lb{jL a,R!h\ where R'b C Rb and = |i26| - \La\. The other cases are analogous. See Figure 1. Notice that instead of pairing the processors in an arbitrary fashion, we should try to pair processors Pa and Pb such that \La\ is as close as possible to |/2&|, therefore minimizing the amount to transfer and the number of redistrib­ ution steps on average (since it is more probable that both processors are left homogeneous or close to). An easy way to do this is to sort the processors by their |La| value and then pair the first and last processors, the second and the next-to-last, and so on. This needs not exchange of synchronization messages, because all processors have the necessary information to plan the same exchange sequence.

58

Fig. 1. Illustration of the exchange process. Processor Pa is made homogeneous since it owns the smaller partition. This partition is exchanged for a similarly sized portion of Pb. Once a processor receives a portion of another suffix array, it merges the new portion with the one it already had. This ensures that the suffixes are lexicographically sorted inside each processor all the time. This is of course true only up to pruning, since equal pruned suffixes are stored in any order. However, those pruned suffixes coming from the same processor are known to be originally in the correct order, and therefore this merging process does not modify the ordering between equal suffixes of the same processor. 4.3 Recursive Step This redistribution of index pointers splits the processors in two groups: those whose index pointers belong to L and those whose index pointers belong to Ft. The two groups of processors proceed independently and apply the algorithm recursively. The non-homogeneous processor could potentially slow down the process, since it has to act in two (parallel) groups. Although it does not affect the total complexity (since a processor belongs at most to two groups), it can affect the constants. To alleviate the problem, we can mark it so that in the next redistribution process it is made homogeneous in the first exchange iteration. It may take longer, but the processor is free for the rest of the iterations. The recursion ends whenever an L or R set of index pointers lies entirely in the local array of a processor. In this case, all that remains to be done is to sort L or R locally. 4.4 Final Local Sorting Throughout the process, the suffixes at each processor are sorted up to prun­ ing. Moreover, we guarantee that equal pruned suffixes coming from the same processor are correctly sorted already. We must, therefore, correctly sort all equal pruned suffixes coming from different processors. To decide those compar­ isons, more characters of the suffixes must be requested to the remote processors owning the suffixes. The number of such remote accesses depends on the text and on the size of the pruned suffixes. Refer to Section 6 for further details.

59

Therefore, this step proceeds as follows, for each processor: the suffix ar­ ray is sequentially traversed. Each time a sequence of equal pruned suffixes is found, they are put in r queues, one per originating processor. Inside each queue, the original order of the suffixes is respected. Then, the first heads of all queues are collected and arranged into a heap data structure (each comparison involves requesting remotely more suffix characters). Once the head of the heap is removed, it is replaced by the next element of the appropriate queue, until we sort all elements. With this ad-hoc heapsort we make only the necessary comparisons.

5 Analysis 5.1 WAT Case We consider the cost T(r) of our distributed algorithm described in Section 4. Since the size of the problem is reduced at each recursion step, the number of processors in the newly generated L and R groups decreases. We consider the cost of a recursive step with r processors initially. The final cost of the recursion is that of solving the subproblems it generates. Note also that there is an initial part outside the recursion, namely the initial local sorting. The initial cost of sorting locally the suffix arrays is 0(6 log 6) I, since it is done in parallel at each processor. Apart from this, the cost T(r) of our algorithm for r processors is as follows: 1. Costs for finding the pivot (costs are parallel for all processors i): (a) selecting the middle element m(t) is 0(1) I; (b) broadcasting the median m(i) is 0(r) C; (c) computation of the median m is 0(r) I; (d) searching m in the local suffix to determine L{i) and J2(i) is 0(log6) I; (e) broadcasting the sizes |L(i)| and |i?(i)| is 0(r) C. 2. Cost of redistributing index pointers in subproblems L and R is as follows. There are at most log r steps because at least half of the processors is made homogeneous at each redistribution step. Since at most 6 index pointers are exchanged in each step (because min(|La|, |/2a|, |£6|, \Rb\) < 6/2), the total cost is 0(6 log r)( I + C ) (we also count the factor • I because of the merging between the old and new pointers). 3. Cost for the recursive calls (processing of groups L and R) depends on the worst-case partition. Let r/, be the number of processors in group L and tr be the number of processors in R. We show that rf 4 < r^, < 3r/4 in the worst case: observe that the esti­ mated median m is larger than rf 2 local medians, each one in turn larger than 6/2 elements of the corresponding processor. Hence, m is larger than n/4 elements which implies that rL is larger than r/4. The proof for the upper bound is analogous.

60

Hence, there are at most log^g r levels in the recursion in the worst case. The number of processors in the larger partition is at most 3/4r (the smaller partition works in parallel and does not affect completion times). Therefore, T(3/4r) must be added to T(r). 4. Cost of sorting the index pointers locally. In the worst case the suffixes are all equal and the same number originated at each processor. In this case the heapsort is O(61ogr)( I + C ). Note that this r is the original one, independent of the recursion (we call it ro). Notice also that this worst case analysis does not improve significantly if instead of a long run of equal pruned suffixes there are many short runs (except when the runs are so short that logr becomes very pessimistic). The complexity of the total execution time is given by the recurrence T( 1) = 0(6 log 6) 1+0(6 log ro)( I + C ) = 0(6 log n) 1+0(6 log ro) C T(r) = 0 (r + 6 log r) I + 0 (r + 6 log r) C + T(3/4 r) which gives T(r) = 0 (r + 6 log r log n) I + 0 (r + 6log2 r) C where we can assume r < 6 to obtain T(r) = 0(6 logr log n) 1 + 0(6log2 r) C. The communication complexity is better than all previous work. This part of the complexity is the most important in practice (as the remote accesses cost much more than CPU operations). Hence, we concentrate in com­ munication costs. The exact constants for the main part of the cost are given by 61og2rlog 4/ 3 r. If we replace the estimated median algorithm by the one given in [12] that obtains the exact median, we have a cost of 0 (r log 6) instead of 0 (r + log 6) in Step (1). As a compensation, the partition is exact and therefore there are exactly r/2 processors on each side. Redoing the analysis for this case we get T(r) = 0 (r log 6 + 6 log r log n) I + 0 (r log 6 + 6 log2 r) C which is the same as before when we consider r < 6. However, the constants of the main part of the cost improve, namely the communication cost becomes 6 log2 r. We consider scalability now. If we double n and r, the new cost T(2n,2r) becomes T(2n, 2r) = T (n,r)x ( j r + t(l + log4/ 3 r+ lo g 2 n) ; r/ln ( 2) + 6(2 log2 r + 1) \ = 1 + o(1) y r + 61og4/3rlog2n r + 6 log2 r log4/3 r J which as long as r < 6 is

T{2n, 2r) = T(», r) (l + O 61

( I + C ))

(the ideal scalability condition is T(2n, 2r) = T(n, r)). While our algorithm does not scale ideally, it does scale much better than previous algorithms (whose scal­ ing factor is 2 in the WAT case). Further, as the number of processors increase, the additional computational time (given by the fraction 1/logr) drops consid­ erably. For instance, if the number of processors doubles from 256 to 512 the execution time goes up by a factor of 25% (instead of also doubling). 5.2 Average Case We show in this section that the algorithm works almost optimally in the average case. The most involved part of the proof is to show that, for large n, the estimated median is almost the exact median. We have proved it in [12] (i.e. the local median is off the global median by a factor of 0 (n -1/2)). The proof is obtained by considering only one processor, as a process where the global median is estimated by sampling 6 elements out of n. When the median of r so computed medians is used, the estimation is even better. Therefore, the distance between the real median and the middle of the local array is 0(y/bjr). Once we prove that, we have that each redistribution session exchanges al­ most all the data in a single step (since \La\ ~ |I»&| « i*. i \Rb\), being the remaining steps so small (in terms of communication amounts) that can be ignored. The first iteration exchanges 0(6) elements, and the rest exchange por­ tions of the array of size 0(y/b/r). Therefore, the cost of the O(logr) exchanges is 0(6 + y/b/r log2 r) = 0(6). It is also possible to perform the merges between old and new arrays at the same cost. Moreover, since the partition is almost perfect, \L\ « \R\} and the next subproblems are almost half the size of the original one, the logarithm previously in base 4/3 is now base 2 and the network is used all the time. To see this, observe that instead of adding T(3/4 r) to T(r), we add T((6/2 + y/b/r) r/b) = T(rf2 + y/r/b) = T(r/2 + o(l)), which makes the final cost 61og2/(1+0(1)) r = 61og2r (1 + o(l)>. Therefore, the average time cost of our algorithm is T(r) = 0 (r + 6 log n) I -f- 0 (r + 6 logr) C = 0(6 logn) I + 0(6 log r) C (the simplification being valid for r < 6). The scalability factor for communica­ tion becomes 1 + (r -I- 6)/(r + 61og2r), i.e. of the same order but about a half of that of the WAT case, while for CPU costs it is 1 -f- 0 (1 /log n). Despite this improvement, the algorithm [12] has better average complexity. The non-homogeneous processor does not add too much to the cost, since it has « 6/2 elements in each partition, and hence exchanges « 6/4 on each group. This takes the same as exchanging 6/2, which is the normal case in the other processors. The final sorting can be cheaper on average than 0(6 log r). However, this analysis is much more difficult and highly dependent on the nature of the text and the length of the pruned suffixes. We can make it cheaper by using longer pruned suffixes (and pay more communication cost) or vice versa. Moreover, the big-0 analysis does not change because there are already other 0(6 log r) costs involved. We leave this point for the experiments that follow.

62

6 Experim ental Results Although we have not implemented yet the parallel algorithm here presented, we performed a preliminary analysis of its behavior taking into account some real texts (Wall Street Journal extracts from the TIPSTER collection [6]). We study the critical aspects of the behavior of the algorithm. 6.1 P runed Suffixes One phenomenon of interest is the effect of pruned suffixes in the algorithm. Suffixes are pruned at £ characters in order to reduce interprocessor communica­ tion of processes asking remote suffixes for local comparison. Pruning influences the whole algorithm because, after the first step of recursion, pointers may point to remote suffixes. We evaluate here the implications of pruning on interprocess communication. We begin by considering the last local sorting. We implemented an external merge of r queues using a heap. We obtain the fraction of comparisons that generated remote accesses for more characters of the suffixes (these correspond indeed to a tie between pruned suffixes). In Table 1 we present average and standard deviation (stdev) for different block sizes and t values, considering an input file of 10 megabytes. In turn, each tie implies two to four remote accesses (depending on just one or both are remote pruned suffixes). This is because there is a request and an answer for each suffix retrieved. However, suffixes already brought from a remote processor can be buffered locally in order to solve eventual posterior ties, with no need to ask them again remotely. We also counted the number of messages really exchanged among processors, if this local buffering is used. Let ties be the total number of ties occurring on a given processor. In the same table we present, in the sixth column, the fraction of the messages exchanged when compared with the worst case (that is, 4* ties). This gives a measure of the effectiveness of the local buffering scheme. We present also the maximum number of messages sent in each case (i.e., the number of messages of the most communicant processor) normalized in per­ centage to the number of total suffixes on the corresponding processor. Since all processors work in parallel, this is related to the global completion time for Step 4 of the algorithm. Finally, we estimate the time in seconds to transfer this maximum number using the model of [11] for smaller messages (see Section 6.3: a = 47 and r = 0.0254) and considering a message of 8 bytes to request (suffix pointer plus processor address) and 54 to answer (processor address plus 50 bytes of the suffix). The results show that a pruned suffix of 30 characters is already a good trade­ off (< 5% remote requests in almost all cases). We observe that the variation between the percentage of ties among processors is rather high. As a matter of fact, the larger the pruned suffix, the larger the variation of the percentage of

63

T 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb

P

t

8 10 8 • 20 8 30 8 40 16 10 16 20 16 30 16 40

% ties 27.84% 7.61% 4.10% 2.54% 24.74& 6.55% 3.67% 2.22%

messages / 4 * ties 20.53%) 9.25% 29.209& 28.00% 26.50% 35.23% 37.14?^ 28.61% 19.78% 13.13% 20.76% 51.49% 62.17% 20.23% 22.47% 76.54% stdev

stdev 5.319& 15.93% 15.89% 15.10% 16.30% 46.92% 51.16% 43.79%

max mess. / # suffix 2.50% 0.88% 0.44% 0.29% 2.97% 1.29% 0.66% 0.49%

estimated time (s) 8.28 2.90 1.46 0.94 4.91 2.12 1.09 0.80

Table 1. Amount of exchanged messages due to pruning (stdev is a percentage over the average). T is the text size and P the number of processors. ties. For example, for £ equal to 10 and 8 processors, we obtained a standard de­ viation of 9.25% over the average. For £ equal to 40, this percentage increases to 37.14%. This means that larger pruned suffixes imply few ties (and few remote accesses), but more text is stocked locally and text characteristics (distribution of words and phrase composition) start to influence the occurrence of identical suffixes. For example, “Wall Street Journal” (19 characters) occur frequently in the text database we use. The processor containing suffixes starting with "W" may ask more remote suffixes than other processors. Another interesting point is compression. To reduce communication overhead when exchanging suffix arrays, we use a compression scheme based on similarity of pruned suffixes. Since the processor that sends a slice will send all the pruned suffixes in ascending order, most suffixes will share a common prefix with their neighbors. This can be used to reduce the amount of communication. This technique has been previously applied to compress suffix array indices [3], and works as follows: the first pruned suffix is sent complete. The next ones are coded in two parts: the length of the prefix shared with the previous pruned suffix; and the remaining characters. For example, to send "core", "court" and "custom", we sent "core", (2 ,"u rt") and (l,"ustom "). Compression rates (i.e. compressed size divided by uncompressed size) aver­ ages and standard deviation are presented in Table 2. With an £ of 30 characters, a 25% of reduction is achieved. As expected for lower £, compression may reduce the size of the pruned suffixes to almost the half of the size. However, as pre­ sented in Table 1, a small £ implies more communication during the local sort of pointers. We also verify that compression rates are also sensible to the text size. The larger this size, the better the compression, due to the higher degree of similarity between contiguous suffixes in the sorted array. Note that we measure compression rates in the first exchange. This should improve in further steps of the recursion, since the suffixes become more and more sorted and therefore longer prefixes are shared among contiguous suffixes.

64

text size # P^c.

10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb 10 Mb

8 8 8 8 16 16 16 16

I 10 20 30 40 10 20 30 40

stdev mean compression rate compression rate

56.06% 65.99% 73.46% 78.19% 59.17% 69.05% 75.90% 80.23%

4.26% 4.41% 3.93% 3.40% 6.90% 6.54% 5.85% 5.11%

Table 2. Percentage of compression (average and percentage of stdev over the average). 6.2 Estim ated M edians We have generated the suffix arrays and the corresponding file of sorted suffixes for two extracts of the Wall Street Journal [6]. These two extracts have 32 and 100 megabytes. We partitioned these files in 8 and 16 blocks, and used t = 30. Then, we obtained the medians of the blocks (m(*)) and computed m, the median of the medians. Next we compared: — m with the real median of the whole extract (called D\)\ — m with each local m(i) (called Dz(i)). We present the distance (in percentage) from the real median. If it is the exact median, the percentage is 0%. If it corresponds to the first or last element of the local suffix array, the deviation is of 100%. The results for the maximum deviations are presented in Table 3. 100Mb-8P 100Mb-16P 32Mb-4P 32Mb-8P 32Mb-16P 0.06% 0.16% 0.42% 0.35% 0.11% Di 1.51% 0.75% 0.29% 0.98% max (D 2(*)) 1.49% Table 3. Deviation among real and estimated medians. Text sizes of 32 and 100 megabytes and number of processors of 4, 8, and 16. According to the numbers presented in Table 3, the text presents a charac­ teristic of auto-similarity, that is, the text blocks on each processor have similar medians, which are in turn similar to the exact global median. Approximate medians (those considering the median of medians, that is, m) are taken on pruned suffixes. Therefore, even using pruned suffixes with a reasonable £, we obtain good approximations (< 2%).

65

We did not go on with the partitioning process, but we also estimated what would happen in the last steps of the recursion process. In these last levels, the compared suffixes are much more similar and the median approximation is based on few samples (but on a smaller text space). For this approximation, we took the global sorted suffix file called GS. We divided GS in 8 and 16 blocks (GSi, where 1 < i < 8 or 1 < i < 16) and took the two first blocks GSi and GS2 (for example, comprising suffixes starting with "A" until "C"). Next we took each suffix of these two initial blocks and chose randomly a processor to hold it (keeping the lexicographical order - since GS has sorted suffixes). Finally, we took the median on each processor and compared with the real median (the last suffix of GSi or the first of G52). Results are presented in the Table 4 for a 100 megabytes file. file size 100 Mb 100 Mb block size fraction 1/8 1/16 distance block 1 0.06% 0.10% distance block 2 0.07% 0.10% Table 4. Deviation among real and estimated medians in part of the last step of the recursion. Simulation is used. The estimated medians on pruned suffixes are very close to the real median. This shows that the estimation is even better in the last steps of the recursion, even considering the effects of pruning. It is important to remark that in both cases (Tables 3 and 4) the approximations are very good for i —30 and different number of processors. This is expected considering that the number of samples is proportionally the same when compared to the size of the text being sam­ pled: e.g., for 16 processors, we sample 16 medians for the whole text. With 2 processors in the last step, we sample 2 medians, but from a text 8 times smaller. 6.3 P artitio n Exchange We know that if the partitions are exact (i.e., m is always identical to the real median of the (sub)set), the partition (L or R) exchanges are performed in one step and without interference among pairs (using a no contention switch). In general, communication in parallel machines can be modeled by a linear equa­ tion [11]: t-com — O C -\- T S p

where 32 kilobytes).

67

proves that the algorithm has the best communication complexity and scaling factor in the WAT case. A comparative table follows. Algorithm Complexity WAT Average Mergesort n ( I + C) n (I + C) [9] Generalized blog n I 6log n I Quicksort [12] + nC + 6C Quicksort blog r log n I blog n I (present work) + blog2 r C + 6logr C

Scaling Factor (1 + ...) WAT Average I+ c I+ c 1/logn I 1/logn I +C 1/logr ( I + C ) 1/log n 1+ 1/ log r C

We are currently working on the implementation of the quicksort based algo­ rithm in order to have real experimental times instead of simulations. We also plan to repeat the experiments with larger texts for the final version.

References 1. T. Anderson, D. Culler, and D. Patterson. A case for NOW (Network of Worksta­ tions). IEEE Micro, 15(l):54-64, February 1995. 2. A. Apostolico, C. Iliopoulos, G. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347-365, 1988. 3. E. Barbosa and N. Ziviani. From partial to full inverted lists for text searching. In R. Baeza-Yates and U. Manber, editors, Proc. of the Second South American Workshop on String Processing (WSP’95), pages 1-10, April 1995. 4. G. Gonnet. PAT 3.1: An Efficient Text Searching System - User’s Manual. Centre of the New Oxford English Dictionary, University of Waterloo, Canada, 1987. 5. G. Gonnet, R. A. Baeza-Yates, and T. Snider. New indices for text: PAT trees and PAT arrays. In Information Retrieval - Data Structures & Algorithms, pages 66-82. Prentice-Hall, 1992. 6. D. Harman. Overview of the third text retrieval conference. In Proceedings of the Third Text Retrieval Conference - TREC-3, pages 1-19. National Institute of Standards and Technology. NIST Special Publication 500-225, Gaithersburg, Maryland, 1995. 7. J. Jaja. An Introduction to Parallel Algorithms. Addison-Wesley, 1992. 8. J. Jaja, K. W. Ryu, and U. Vishkin. Sorting strings and constructing digital search trees in parallel. Theoretical Computer Science, 154(2):225-245, 1996. 9. J. P. Kitajima, B. Ribeiro, and N. Ziviani. Network and memory analysis in distributed parallel generation of PAT arrays. In 14th Brazilian Symposium on Computer Architecture, pages 192-202, Recife, August 1996. 10. U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22, 1993.

68

11. J. Miguel, A. Arruabarrena, R. Beivide, and J. A. Gregorio. Assessing the perfor­ mance of the new IBM SP2 communication subsystem. IEEE Parallel & Distrib­ uted Technology, 4(4):12-22, Winter 1996. 12. G. Navarro, J. P. Kitajima, B. Ribeiro, and N. Ziviani. Distributed generation of suffix arrays. In A. Apostolico and J. Hein, editors, Proc. of the Eighth Symposium on Combinatorial Pattern Matching (CPM97), Springer-Verlag Lecture Notes in Computer Science v. 1264, pages 102-115, Arhus, Denmark, June 1997. 13. M. J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, second edition, 1994.

This article was processed using the I^I^jX2e macro package with CUP-CS class

69

Transposition distance between a permutation and its reverse Joao Meidanis15 Maria Emilia M. T. Walter213 Zanoni Dias14 1 University of Campinas, Institute of Computing, Brazil. 2 University of Brasilia, Computer Science Department, Brazil. 3 Partially supported by Brazilian agency CAPES. 4 Supported by Brazilian agency CNPq. 5 Partially supported by Brazilian agencies PAPESP and CNPq. Abstract. In this note we solve an open question posed by Bafna and Pevzner [1], regarding chromosome distance with respect to transposi­ tions: we show that the distance between a permutation and its reverse (without complementation) is |n/2j +1, where n is the size of the per­ mutations. We also present an algorithm to compute an optimal series of transpositions.

1 Introduction The huge amount of data resulting from genome sequencing in Molecular Bi­ ology is giving rise to an increasing interest in the development of algorithms for comparing genomes of related species. Particularly these data prompted re­ search on mutational events acting on large portions of the chromosomes. Such events can be used to compare genomes for which the traditional methods of comparing DNA sequences are not conclusive. The field originated by the study of large mutations on chromosomes is known as genome rearrangements. There are several mutational events affecting large fragments of genomes of organisms, including duplication, reversal, transposition (acting on a single chromosome), translocation, fusion, and fission (involving more than one chro­ mosome). Each such event or combination of events gives rise to a theoretical problem of finding, given two genomes, the shortest series of events that trans­ forms one genome into the other. We seek the shortest series because it has the largest likelihood of occurrence under a general principle of parsimony. Notice that in general more than one shortest series exists. The length of the shortest series is called the distance between the two genomes. Chromosomes are usually represented as permutations of integers in a given range, each integer representing a gene. Sometimes the integers are signed to indicate the orientation of the gene. However, when the gene orientations are unknown or not relevant (as in the case of transpositions), the integers are unsigned. In the last few years we have witnessed formidable advances in our under­ standing of genome rearrangements. A partial list of known results follows. With respect to the reversal event, Hannenhalli and Pevzner [6] presented the first

70

polynomial time algorithm to find the distance, later improved on its running time by Berman and Hannenhalli [2], and Kaplan, Shamir, and Tarj an [7]. These results concern signed permutations. For the unsigned case, also involving rever­ sals, Caprara [3] showed that finding the distance is NP-hard. Hannenhalli and Pevzner [5] studied a multichromosomal distance problem for signed genomes involving reversals, fusion, fission, and a specific form of translocation, produc­ ing a polynomial time algorithm in this case as well. Bafna and Pevzner [1] analyzed the problem with respect to transpositions, presenting several approx­ imation algorithms, and leaving a number of open questions, among them the complexity of the problem and the diameter (largest possible distance between two permutations of size n). Gu, Peng, and Sudborough [4] gave approximation algorithms for the combination of events of reversal and transposition. In this note we solve an open question posed by Bafna and Pevzner [1], regarding chromosome distance with respect to transpositions: we show that the distance between a permutation and its reverse (without complementation) is |n/2j + 1, where n is the size of the permutations. Besides, we present an algorithm to compute an optimal series of transpositions.

2 Definitions Chromosomes are represented by permutations of integers in the range l..n, where n is the number of genes of interest in the chromosome. For instance, (3 4 2 6 1 5) represents a chromosome with six genes. A transposition is an operation that transforms a permutation into another one, “cutting” a certain portion of the permutation and “pasting” it elsewhere in the same permutation. A transposition p(i, j, k) is defined by three integers i, j , and k such that 1 < i < j < n + l , l < h < n + l, and k & [i,j], in the following way. It “cuts” the portion between positions i and j —1, including the extremes, and “pastes” it just before position k. Thus, we can write p(i,j, k ) ' (7ri7T2 ••• •••7Tj ... 7T* ...7T„) = (7Ti7T2 •••JTi-llTj •••JTfc-lH’ t •••TTj— lTTjfe•••^"n)>

i f i < j < k , and

p(i,j,k) •(7T17T2 ...IT*...**. ••ITj .. .7T„) = (7Ti7T2 ... TTfc-lfl-*.. .7rj_i7T* ...7T*_i7Tj ...7T„),

if k < i < j. Notice that p(i,j, k) = p(j, k , i) when i< j < k. Given two permutations tt and i, the problem is to find a prefix code for E that minimizes the weighted length of a code string, defined to be !Cr=i wiki where /,• is the length of the codeword assigned to a flogn], we must minimize ^?=i W* constrained to /,• < L for i = 1 ,..., n. We also assume the weights u>i,..., wn are sorted, with w\ < ... < tun. If L is large enough, this problem can be solved with 0(n) complexity for time and space by one of the efficient implementations of Huffman’s Algorithm [Huf52, Lee76]. Katajainen and Moffat presented an implementation of this algorithm that requires 0(1) space and the same 0(n) time [KM95]. Gilbert [Gil71] recommends to formulate the length restricted problem when the weights it>* are inaccurately known. Choueka, Klein and Perl [Cho85] suggest the use of length restricted codes to reduce the external path length £2”=1 U- The 1In a full binary tree each internal node has exactly two sons.

80

objective is to allow space efficient decoding of optimal prefix codes without bitmanipulation. Zobel and Moffat [Zob95] describe the use of word-based Huffman codes for compression of large textual databases. The application allows the maximum of 32 bits for each codeword. For the cases that exceed this limitation, it is recommended to use codes with length restriction. Some methods can be found in the literature to solve the length-restricted problem [HuT72, Voo74, Gar74], Larmore and Hirschberg [Lar90] present the Package-Merge algorithm. This algorithm constructs optimal restricted codes in O(nL) time and requires O(n) space. Another implementation of PackageMerge was presented by Katajainen, Moffat and Turpin [KMT95]. It requires the same 0(nL) for time and 0(L2) for space. This implementation is much more space-economical in most practical cases, where L is close to flog n]. Fraenkel and Klein [Fra93] show a heuristic that produces suboptimal codes in 0(n log n) time, with 0(n) space requirement. Milidiu and Laber [Mil97, Lab97] present the WARM-UP algorithm. This algorithm produces suboptimal length-restricted codes spending 0(nlogn -1- n log wn) time in the worst case, where wn is the highest presented weight. In this paper, we propose a practical and simple implementation for the WARM-UP algorithm. This implementation has 0(nlogn + nlog twn) worst case time complexity, and requires only 0(1) additional space. We also report some experimental results that confirm its efficiency. This paper is organized as follows. In section 2, we describe the WARM-UP algorithm and its main aspects. In section 3, we detail the new implementation. In section 4, we report the results provided by some experiments performed with the WARM-UP Algorithm, the ROT Heuristics and the LazyPM algorithm. These experiments apply the three algorithms to compress the 14 files of the Calgary Corpus Collection. The measured values on the experiments were the elapsed CPU-time, the required additional space and the compression obtained by each method. Finally, in section 5 we summarize our conclusions.

2 W ARM-UP ALGORITHM The WARM-UP algorithm proposed by Milidiu e Laber [Mil97, Lab97] intro­ duces a novel approach to the construction of length-restricted prefix codes. It is based on some fundamental properties of Lagrangean relaxation, a well known technique of widespread use in the solution of combinatorial optimization prob­ lems. Next, in order to describe the WARM-UP algorithm, we introduce the nec­ essary definitions and notation. 2.1 Definitions and Notation Let W denote a set of positive integer weights {tui,..., wn}. For a given real value x, let us define the associated set of weights Wx by Wx = {maa;{u>i,a?},

...,max{wntx}}.

81

We use Tx to denote any Huffman tree for Wx, and h(T) to denote the height of a tree T. We also use T~ (T+) to denote a minimum (maximum) height Huffman tree [Sch64] for W*. If the original weight of a given leaf of T is smaller than x, then we say that this leaf is warmed to the value of x in a tree Tx.

Fig. 1. (a) A Huffman tree for the set of weights W = {1,1,2,3,5,8,13} (b) A Tree Tx for the set of weights W 1S (c) An optimal tree for the set of W = {1,1,2,3,5,8,13}, and height restricted to 4.

Figure l.(a) shows a Huffman tree for W = {1,1,2,3,5,8,13}, whereas fig­ ure l.(b) illustrates a T1.5 tree for the corresponding set of weights W 15 = {1.5,1.5,2,3,5,8,13}. Observe that the tree in figure l.(a) has height equal to 6, and the one in figure l.(b) has height equal to 4. 2.2 Description The basic approach of the WARM-UP algorithm is to look for a value of x such that Tx has height L. To find an adequate x, a binary search is performed in the range from w\ to wn. For each selected value of x, a T~ and a T+ Huffman trees

82

are constructed. If either h(T~) = L or h(T+) = L then the algorithm halts. If that doesn’t happen, then one of the following three actions is necessary: 1. if h(T~) > L then increase the value of x and continue with the binary search; 2. if h(T+) < L then decrease the value of x and continue with the binary search; 3. if h(T£) > L and h(T~) < L then call the procedure Ties(x). The procedure Ties(x) [Mil97, Lab97] called in item 3 builds a Huffman tree with height L for the set Wx. Whenever h(T+) > L and h(T~) < Ly the procedure Ties obtains a Huffman tree with height L by performing only 0 (n logn) time effort. The binary search ends either when one finds a tree with height L> or when the search interval is smaller than 1/n 2. If the algorithm stops due to the second condition, the algorithm finds the unique rational number p/q, with q < n, in the last search interval. After that, the algorithm builds a Tx tree, with x = p/q. It is proved in [Mil97, Lab97] that such a tree has height equal to L. 2.3 Approach The correctness of the binary search approach relies on the following inequality proved in [Mil97, Lab97]. If x < y, then

A(T+) < h(T~).

(1)

This inequality implies that whenever we increase x then the tree height corre­ sponding to this new value of x cannot be greater than the one corresponding to the old value of x. Similarly, if we reduce x then we get the opposite situation. The second stopping condition of the WARM-UP algorithm relies on the proven existence of a rational number p/q, with q < n such that h(Tp/q) = L. Since on an interval of length no greater than 1/n 2 there exists exactly one such rational number, then the WARM-UP algorithm searches for this number. 2.4 Approximation and Complexity Milidiu and Laber [Mil97, Lab97] considered the difference between the average length of the code obtained through Tx, by replacing the leaves with weight x by their original weights (see figure l.(c)), and the average length of the opti­ mal prefix code with restricted maximal length L. They proved that these two averages differ by at most 1 if [log n] + 1 < L < flog n] + 5 and by at most 4 /(iffL’-\}°en\ - 2 _ 4^ jf > [logn] + 5, where -0 is the golden ratio 1.618. They also proved that if all the warmed leaves are at level L in Tx, then these two averages are equal. By checking the two leaves with weight 1.5 in l.(b) we

83

observe that they both have height 4. Therefore, one can conclude that the tree in figure l.(c) is optimal for L = 4 and W = {1,1,2,3,5,8,13}. The worst case time complexity of the WARM-UP algorithm is given by 0(n log n + nlogtun). The nlogn term is due to the time effort necessary to do the ordering of the weights. The n log wn term is due to the maximum of logn2.wn values of x to be tested, times the O(n) effort to build the two trees to be tested.

3 W ARM -UP Implementation In this section we describe a new implementation of the WARM-UP algorithm. This implementation introduces some refinements and practical modifications in the original algorithm. The pseudo-code for this implementation is presented in figure 2. The implementation is divided into two phases. In the first one, we check if the height of the minimum height Huffman tree for the given set of weights is smaller than or equal to L. If that happens, the algorithm returns this tree. In the second phase, the algorithm initializes the values of inf and sup and calls the procedure GetFirstGuess to assign an initial value to x. Next, it executes a binary search to look for the smallest integer x such that h(Tx) < L. The algorithm stops either if it finds such a value or if all the warmed leaves are at level L in Tx. In the second case, the WARM-UP algorithm produces the optimal restricted code for W. In the following subsections we consider the main aspects of this implementation. 3.1 Binary Search We restrict the search of the value to the set of the integer numbers between w\ and wn. We have chosen to check just the integers because we believe that such restriction would not deteriorate the compression ratio. This assumption was confirmed by the experimental results reported in section 4. Moreover, the utilization of integers allows for a small additional space usage and a faster execution time. In order to construct the Huffman trees T~ and T+ and determine their heights, we use a sophisticated implementation of Huffman algorithm proposed by Katajainen and Moffat [KM95]. This implementation requires only 0(1) additional space to calculate a vector of optimal codeword lengths for an input vector of weights. In this case, the total memory required is the input and output buffer size plus a few words. The original implementation was slightly modified to allow the construction of both minimum and maximum height Huffman Trees. Since the value of x was restricted to the set of integers, then, in some cases, the new implementation does not find a value of x such that h(Tx) = L. In those cases, the tree Tx with the smallest value of x such that h(Tx) < L is returned. For example, if W = {1,2,2,8,8,16,24,40} and L = 6, then it’s not difficult to show that just for * = 8/3 we have h(Tx) = 6. In the experiments of section 4 such a situation was not observed.

84

3.2

Stopping Condition

For a given set of weights, different values of x often generate T*’s trees with the same height L. The original WARM-UP algorithm stops whenever it finds the first value x such that h(Tx) = L. The new implementation looks for the smallest integer x such that h(Tx) < L. It also stops if a tree Tx is assured to be optimal. Therefore, the binary search goes on until one of the following conditions occurs: 1. T+ Is Optimal or T~ Is Optimal; 2. h(T+) > L and h{T~) < L\ 3. h{T+) > L and h{T~) = L\ 4. the size of the search interval is equal to 1. "Phase 0: Trying a Huffman Tree” If h(T^1) < L then Retum(Tj,) "Phase 1: Binary search”_____ sup 4- wn x GetFirstGuess(L, w\,..., wn) Repeat h~i-h(T-) If Is Optimal Then Retum(TiT) If h~ > L Then If (sup —x) —1 Then Retum(T^»p) in/ x Else h+ L and h~ < L Then Retum(Ties(x)) If h+ > L and h~ —L Then Return (TIT) If (x —inf) = 1 Then Return(7^) sup 4- x End If x 4- (inf + sup)/2 End Repeat

Fig. 2. The pseudo-code for the WARM-UP new implementation The condition of item 1 is verified by checking if T+ or T~ have all the warmed leaves at level L. Under the conditions of either item 2 or item 3, the inequality (1) assures that by reducing the value of x we get trees with height greater than L. Hence,

85

the binary search must stop. The fourth condition forces the search to stop because only integer values are allowed. The search for the smallest value of x may increase the number of values that are examined. Hence, more Huffman trees may be constructed. Nevertheless, we have chosen this implementation because it often produces a tree that generates a significantly better code. In addition, our experiments show that in most cases the algorithm constructs an optimal tree for the first value assigned to x. 3.3 An initial value for x In order to improve the time spent by the algorithm we include a new refinement: instead of using (u>i + wn)/2 as the first value tried for x, we have implemented a heuristic to determine this initial value. This heuristic is implemented by the procedure GetFirstGuess. This heuristic is based on the following fact: if we relax the integrality con­ straint on the codeword lengths, then the ith codeword length of an optimal code would be given by —logp*, where p,- is define by pi = tu, / u>j. Based on this fact, this heuristic looks for a value x such that “

log kX+

Wj

=

1

(2)

where k is the number of weights in the set W that are smaller than x. The procedure GetFirstGuess obtains the initial value for x, through a single scan on the vector of weights. For each position i in the vector, the procedure executes two steps. First, it assigns to * the value of expression E j=