Genetics mapping, physical mapping and DNA sequencing are the three key components of the human and other genome project
545 128 22MB
English Pages 236 [229] Year 2012
The IMA Volumes in Mathematics and its Applications Volume 81 Series Editors Avner Friedman Robert Gulliver
Springer Science+Business Media, LLC
Institute for Mathematics and its Applications IMA The Institute for Mathematics and its Applications was established by a grant from the National Science Foundation to the University of Minnesota in 1982. The IMA seeks to encourage the development and study of fresh mathematical concepts and questions of concern to the other sciences by bringing together mathematicians and scientists from diverse fields in an atmosphere that will stimulate discussion and collaboration. The IMA Volumes are intended to involve the broader scientific community in this process. A vner Friedman, Director Robert Gulliver, Associate Director
********** IMA ANNUAL PROGRAMS
1982-1983 1983-1984 1984-1985 1985-1986 1986-1987 1987-1988 1988-1989 1989-1990 1990-1991 1991-1992 1992-1993 1993-1994 1994-1995 1995-1996 1996-1997 1997-1998
Statistical and Continuum Approaches to Phase Transition Mathematical Models for the Economics of Decentralized Resource Allocation Continuum Physics and Partial Differential Equations Stochastic Differential Equations and Their Applications Scientific Computation Applied Combinatorics Nonlinear Waves Dynamical Systems and Their Applications Phase Transitions and Free Boundaries Applied Linear Algebra Control Theory and its Applications Emerging Applications of Probability Waves and Scattering Mathematical Methods in Material Science High Performance Computing Emerging Applications of Dynamical Systems
Continued at the back
Terry Speed Michael S. Waterman Editors
Genetic Mapping and DNA Sequencing With 35 Illustrations
Springer
Terry Speed Department of Statistics University of California at Berkeley Evans Hall 367 Berkeley, CA 94720-3860 USA
Michael S. Waterman Department of Mathematics and Molecular Biology University of Southern California 1042 W. 36th Place, DRB 155 Los Angeles, CA 90089-1113 USA
Series Editors: A vner Friedman Robert Gulliver Institute for Mathematics and its Applications University of Minnesota Minneapolis, MN 55455 USA
Mathematics Subject Classifications (1991): 62F05, 62FlO, 62F12, 62F99, 62H05, 62H99, 62K99, 62M99, 62PlO Library of Congress Cataloging-in-Publication Data Genetic mapping and DNA sequencing/[edited by) Terry Speed, Michael S. Waterman. p. cm. - (IMA volumes in mathematics and its applications; v.81) Includes bibliographical references. ISBN 978-1-4612-6890-1 ISBN 978-1-4612-0751-1 (eBook) DOI 10.1007/978-1-4612-0751-1 1. Gene mapping-Mathematics. 2. Nucleotide sequenceMathematics. I. Speed, T.P. 11. Waterman, Michael S. III. Series. QH445.2.G448 1996 574.87'322'0151-dc20 96-18414 Printed on acid-free paper.
© 1996 Springer Science+Business Media New York Originally published by Springer-Verlag New York, Inc in 19% Softcover reprint of the hardcover 1st edition 1996 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher Springer Science+Business Media, LLC, except for brief excerpts in connection with reviews or scholarly analysis. Use in conneetion with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general deseriptive names, trade names, trademarks, ete., in this publieation, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely byanyone. Authorization to photoeopy items for internal or personal use, or the internal or personal use of specifie elients, Springer Science+Business Media, LLC, provided that the appropriate fee is paid directly to Copyright Clearanee Center, 222 Rosewood Drive, Danvers, MA 01923, USA (Telephone: (508)750-8400), stating the ISBN and title of the book and the first and last page numbers of eaeh article eopied. The copyright owner's consent does not include copying for general distribution, promotion, new works, or resale. In these cases, specific written permission must first be obtained from the publisher. Produetion managed by Hal Henglein; manufacturing supervised by Jacqui Ashri. Camera-ready eopy prepared by the IMA. 987654321 ISBN 978-1-4612-6890-1
SPIN 10524747
FOREWORD This IMA Volume in Mathematics and its Applications
GENETIC MAPPING AND DNA SEQUENCING
is one of the two volumes based on the proceedings of the 1994 IMA Summer Program on "Molecular Biology" and comprises Weeks 1 and 2 of the four-week program. Weeks 3 and 4 will appear as Volume 82: Mathematical Approaches to Biomolecular Structure and Dynamics. We thank Terry Speed and Michael S. Waterman for organizing Weeks 1 and 2 of the workshop and for editing the proceedings. We also take this opportunity to thank the National Institutes of Health (NIH) (National Center for Human Genome Research), the National Science Foundation (NSF) (Biological Instrumentation and Resources), and the Department of Energy (DOE), whose financial support made the summer program possible.
Avner Friedman Robert Gulliver
v
PREFACE Today's genome projects are providing vast amounts of information that will be essential for biology and medical science in the 21st century. The worldwide Human Genome Initiative has as its primary objective the characterization of the human genome. Of immediate interest and importance are the locations and sequences of the 50,000 to 100,000 genes in the human genome. Many other organisms, from bacteria to mice, have their own genome projects. The genomes of these model organisms are of interest in their own right, but in many cases they provide valuable insight into the human genome as well. High-resolution linkage maps of genetic markers will play an important role in completing the human genome project. Genetic maps describe the location of genetic markers along chromosomes in relation to one another and to other landmarks such as centromeres. Genetic markers in humans include thousands of genetic variants that have been described by clinicians and that in other organisms are called mutants as well as the more recent molecular markers, which are based on heritable differences in DNA sequences that may not result in externally observable differences among individuals. Such molecular genetic markers are being identified at an increasing rate, and so the need for fast and accurate linkage and mapping algorithms of ever-increasing scope is also growing. In addition to playing an important role in long-term genome projects, genetic maps have many more immediate applications. Given data from suitably designed crosses with experimental organisms, or from pedigrees with humans and other animals, new mutations, genes, or other markers can frequently be mapped into close proximity to a well-characterized genetic marker. This can then become the starting point for cloning and sequencing the new mutation or gene. Approaches like this have given detailed information about many disease genes and have led to success in determining genes causing cystic fibrosis and Huntington's disease. During meiosis prior to the formation of gametes, a random process known as crossing over takes place one or more times on the average on each chromosome. Crossovers cannot be observed directly, but they can leave evidence of having occurred by causing recombination among nearby genetic markers. When two (or more) markers are inherited independently, recombinants and non-recombinants are expected in equal proportions among offspring. When the markers appear to be co-inherited more frequently than would be expected under independence, a phenomenon called genetic linkage, this is taken as evidence that they are located together on a single chromosome. The first paper in this volume, by McPeek, explains this process in greater detail than can be done here. The genetic distance between
vii
Vlll
PREFACE
two markers is defined to be the expected number of crossovers per meiosis occurring between the two markers on a single chromosome strand. Since crossovers cannot be observed, only recombination patterns between markers can be counted. Thus, the quantities that can be estimated from cross or pedigree data are recombination fractions, and these need to be connected to genetic distances using a statistical model. Most workers use a model based on the Poisson distribution, which is known not to be entirely satisfactory, and some current research addresses the question of just what is a suitable model in this context. The appropriateness of the Poisson model is considered in the papers by Keats and Ott, and alternatives to it are discussed by Speed. Given a statistical model for the crossover-recombination process, there remain formidable problems in ordering and mapping a number of markers from a single experiment or set of pedigrees, as well as difficulties of incorporating new data into existing maps. Most of the problems of the first kind stem from the many forms of incompleteness that arise with genetic data. At the lowest level, data may simply be missing. However, we may have data, e.g. on disease status, that can change over time, so that even a disease phenotype is not unambiguously observed. Many genetic diseases exhibit this so-called incomplete penetration. At the next level, we may have certain knowledge of phenotypes but, because of the trait being dominant or recessive, not know the genotype. Finally, to carry out linkage or mapping studies, calculations need to be based on the haplotypes of a set of markers; that is, we need to know which alleles go together on each chromosome. A special class of missing data problems arises when we attempt to locate genes that contribute to quantitative traits, which are not simply observable. Standard statistical methods such as maximum likelihood remain appropriate for these problems, but their computational burden grows quickly with the number of markers and the size and complexity of pedigrees. Similar difficulties arise with other organisms, and each presents its own problems, for cross or pedigree data from, say, maize, fruit flies, mice, cattle, pigs and humans, all have their own unique features. There are likely to be many challenging statistical and computational problems in this area for some time to come. For an indication of some of these challenges, the reader is referred to the papers in this volume by Dupuis, Lin and Sobel et al. Together they survey many of the problems in this area of current interest. The next level of DNA mapping is physical mapping, consisting of overlapping clones spanning the genome. These maps, which can cover the entire genome of an organism, are extremely useful for genetic analysis. They provide the material for less redundant sequencing and for detailed searches for a gene among other things. Complete or nearly complete physical maps have been constructed for the genomes of Escherichia coli, Saccharomyces cervisiae, and Caenorhabdits elegans. Many efforts are under
PREFACE
IX
way to construct physical maps of other organisms, including man, mouse and rice. Just as in DNA sequencing, to be mentioned below, most mapping experiments proceed by overlapping randomly chosen clones based on experimental information derived from the clones. In sequencing, the available information consists of a sequence of the clone fragment. In physical mapping, the information is a less detailed "fingerprint" of the clone. The fingerprinting scheme is dependent on the nature of the clones, the organisms under study, and the experimental techniques available. Clones with fingerprints that have sufficient features in common are declared to overlap. These overlapping clones are assembled into islands of clones that cover large portions of the genome. Physical mapping projects are very labor and material expensive, and they involve many choices as to experimental technique. The very choice of clone type varies from about 15,000 bases (Lambda clones) up to several hundred thousand bases (yeast artificial chromosomes or YACs). In addition, the fingerprint itself can range from a simple list of selected restriction fragment sizes to a set of sites unique in the genome. Different costs, in material and labor, as well as different amounts of information will result from these choices. Statistics and computer science are critical in providing important information for making these decisions. The paper of Balding et at. develops strategies using pools of clones to find those clones possessing particular markers (small pieces of DNA called sequence tagged sites or STSs). Their work involves some interesting statistics. The most detailed mapping of DNA is the reading of the sequence of nucleotides. One classic method is called shotgun sequencing. Here a clone of perhaps 15,000 letters is randomly broken up into fragments that are read by one run of a sequencing machine. These reads are about 300 - 500 letters in length. The sequence is assembled by determining overlap between the fragments by sequence matching. The sequence is not perfectly read at the fragment level, and this is one source of sequencing errors. Another source of errors comes from the repetitive nature of higher genomes such as human. Repeated sequences make it very difficult to find the true overlap between the fragments and therefore to assemble the sequence. Statistical problems arise in estimating the correct sequence from assembled fragments and in estimating the significance of the pairwise and multiple overlaps. The paper of Huang is an update of the original "greedy" approach of Staden. This paper takes the fragment sequences as input. Of particular note is the use of large deviation statistics and computer science to very rapidly make all pairwise comparisons of fragments and their reverse complements. Scientists are working to make the existing sequencing methods more efficient and to find new methods that allow more rapid sequence determination. For example, in multiplex sequencing, the information of several gel runs is produced in a single experiment. In another direction, automated machines such as the Applied Biosystems 373A sequencer produce
x
PREFACE
machine-readable data for several gel runs in parallel. Two of the papers in this volume, Nelson and Tibbetts et ai., are about the inference of sequence from raw data produced by these machines. Modern molecular genetics contains many challenging problems for mathematicians and statisticians, most deriving from technological advances in the field. We hope that the topics discussed in this volume give you a feel for the range of possibilities in this exciting and rapidly developing area of applied mathematics. Terry Speed Michael S. Waterman
CONTENTS Foreword ............................................................. v Preface ............................................................. vii An introduction to recombination and linkage analysis. . . . . . . . . . . . . . . .. 1 Mary Sara McPeek Monte Carlo methods in genetic analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15 Shili Lin Interference, heterogeneity and disease gene mapping ................. 39 Bronya Keats Estimating crossover frequencies and testing for numerical interference with highly polymorphic markers ......................... 49 Jurg Ott What is a genetic map function? ..................................... 65 T.P. Speed Haplotyping algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89 Eric Sobel, Kenneth Lange, Jeffrey R. O'Connell, and Daniel E. Weeks Statistical aspect of trait mapping using a dense set of markers: a partial review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 111 Josee Dupuis A comparative survey of non-adaptive pooling designs. . . . . . . . . . . . . .. 133 D.J. Balding, w.J. Bruno, E. Knill, and D.C. Torney Parsing of genomic graffiti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 155 Clark Tibbetts, James Golden, III, and Deborah Torgersen Improving DNA sequencing accuracy and throughput ............... 183 David O. Nelson Assembly of shotgun sequencing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 207 Xiaoqiu Huang
xi
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS MARY SARA McPEEK· Abstract. With a garden as his laboratory, Mendel (1866) was able to discern basic probabilistic laws of heredity. Although it first appeared as a bafHing exception to one of Mendel's principles, the phenomenon of variable linkage between characters was soon recognized to be a powerful tool in the process of chromosome mapping and location of genes of interest. In this introduction, we first describe Mendel's work and the subsequent discovery of linkage. Next we describe the apparent cause of variable linkage, namely recombination, and we introduce linkage analysis. Key words. genetic mapping, linkage, recombination, Mendel.
1. Mendel. Mendel's (1866) idea of enumerating the offspring types of a hybrid cross and his model for the result provided the basis for profound insight into the mechanisms of heredity. Carried out over a period of eight years, his artificial fertilization experiments involved the study of seven characters associated with the garden pea (various species of genus Pisum), with each character having two phenotypes, or observed states. The characters included the color of the petals, with purple and white phenotypes, the form of the ripe seeds, with round and wrinkled phenotypes, and the color of the seed albumen, i.e. endosperm, with yellow and green phenotypes. Mendel first considered the characters separately. For each character, he grew two true-breeding parental lines, or strains, of pea, one for each phenotype. For instance, in one parental line, all of the plants had purple petals, and furthermore, over a period of several years, the offspring from all self-fertilizations within that line also had purple petals. Similarly, he grew a true-breeding parental line of white-flowered peas. When he crossed one line with the other by artificial fertilization, all the resulting offspring, called the first filial or Fl generation, had purple petals. Therefore, the purple petal phenotype was called dominant and the white petal phenotype recessive. After self-fertilization within the Fl generation, among the offspring, known as the second filial or F2 generation, 705 plants had purple and 224 plants had white petals out of a total of 929 F2 plants. This approximate 3:1 ratio (p-value .88) of the dominant phenotype to the recessive held for the other six characters as well. Mendel found that when F2 plants with the recessive phenotype were self-fertilized, the resulting offspring were all of the recessive type. However, when the F2 plants with the dominant phenotype were self-fertilized, 1/3 of them bred true, while the other 2/3 produced offspring of both phenotypes, in a dominant to recessive ratio of approximately 3:1. For instance, among • Department of Statistics, University of Chicago, Chicago, 1
n 60637.
2
MARY SARA McPEEK
100 F2 plants with purple petals, 36 bred true, while 64 had both purple and white-flowered offspring (the numbers of these were not reported). Mendel concluded that among the plants with the dominant phenotype, there were actually two types, one type which bred true and another hybrid type which bred in a 3:1 ratio of dominant to recessive. Mendel's explanation for these observations is that each plant has two units of heredity, now known as genes, for a given character, and each of these may be one of two (or more) types now known as alleles. Furthermore, in reproduction, each parent plant forms a reproductive seed or gamete containing, for each character, one of its two alleles, each with equal chance, which is passed on to a given offspring. For instance, in the case of petal color, the alleles may be represented by P for purple and p for white. (In this nomenclature, the dominant allele determines the letter of the alphabet to be used, and the dominant allele is uppercase while the recessive allele is lowercase.) Each plant would have one of the following three genotypes: pp, pP or PP, where types pp and PP are known as homozygous and type pP is known as heterozygous. Plants with genotype pp would have white petals, while those with genotype pP or PP would have purple petals. The two parental lines would be of genotypes pp and PP, respectively, and would pass on gametes of type p and P, respectively. The Fl generation, each having one pp parent and one PP parent, would then all be of genotype pP. A given Fl plant would pass on a gamete of type p or of type P to a given offspring, each with chance 1/2, independent from offspring to offspring. Then assuming that maternal and paternal gametes are passed on independently, each plant in the F2 generation would have chance 1/4 to be of genotype pp, 1/2 to be of genotype pP, and 1/4 to be of genotype PP, independently from plant to plant. In a large sample of plants, this multinomial model would result in an approximate 3:1 ratio of purple to white plants with all of the white plants and approximately 1/3 of the purple plants breeding true and the other approximately 2/3 of the purple plants breeding as in the Fl generation. Mendel's (1866) observations are consistent with this multinomial hypothesis. Mendel's model for the inheritance of a single character, in which the particles of inheritance from different gametes come together in an organism and then are passed on unchanged in future gametes has become known as Mendel's First Law. Mendel (1866) also considered the characters two at a time. For instance, he considered the form of the ripe seeds, with round (R) and wrinkled (r) alleles, and the color of the seed albumen, with yellow (Y) and green (y) alleles. Mendel crossed a true-breeding parental line in which the form of the ripe seeds was round and the color of the seed albumen was green (genotype RRyy) with another true-breeding parental line in which the form of the ripe seeds was wrinkled and the color of the seed albumen was yellow (genotype rrYY). When these characters were considered singly, round seeds were dominant to wrinkled and yellow albumen
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS
3
TABLE 1.1
The sixteen equally-likely genotypes among the F2 generation (top margin represents gamete contributed by father, left margin represents gamete contributed by mother).
RY Ry rY ry
RY RRYY RRYy RrYY RrYy
Ry RRYy RRyy RrYy Rryy
rY RrYY RrYy rrYY rrYy
ry RrYy Rryy rrYy rryy
was dominant to green. All of the Fl offspring had the yellow and round phenotypes, with genotype RrYy. In the F2 generation, according to the results of the previous experiments, 1/4 of the plants would have the green phenotype and the other 3/4 the yellow phenotype, and 1/4 would have the wrinkled phenotype and the other 3/4 the round phenotype. Thus, if these characters were assumed to segregate independently, we would expect to see 1/16 green and wrinkled, 3/16 yellow and wrinkled, 3/16 green and round, and 9/16 yellow and round, i.e. these phenotypes would occur in a ratio of 1:3:3:9. The experimental numbers corresponding to these categories were 32, 101, 108, and 315, respectively, which is consistent with the 1:3:3:9 ratio (p-value .93). Mendel further experimented with these F2 plants to verify that each possible combination of gametes from the Fl generation was, in fact, equally likely (see Table 1.1). From these and other similar experiments in which characters were considered two or three at a time, Mendel concluded that the characters did segregate independently. The hypothesis of independent segregation has become known as Mendel's Second Law. The above example provides an opportunity to introduce the concept of recombination. When two characters are considered, a gamete is said to be parental, or nonrecombinant, if the genes it contains for the two characters were both inherited from the same parent. It is said to be recombinant if the genes it contains for the two characters were inherited from different parents. For instance, in the previous example, an Fl individual may pass on to an offspring one of the four gametes, RY, Ry, rY, or ry. Ry and r Yare the parental gametes, because they are each directly descended from parental lines. RY and ry are recombinant gametes because they represent a mixing of genetic material which had been inherited separately. Mendel's Second Law specifies that a given gamete has chance 1/2 to be a recombinant. Fisher (1936) provides an interesting statistical footnote to Mendel's work. His analysis of Mendel's data shows that the observed numbers of plants in different classes actually fit too well to the expected num-
4
MARY SARA McPEEK
bers, given that the plant genotypes are supposed to follow a multinomial model (overall p-value .99993). That Mendel's data fit the theoretical ratios too well suggests some selection or adjustment of the data by Mendel. Of course, this in no way detracts from the brilliance and importance of Mendel's discovery. 2. Linkage and recombination. Mendel's work appeared in 1866, but languished in obscurity until it was rediscovered by Correns (1900), Tschermak (1900) and de Vries (1900). These three had independently conducted experiments similar to Mendel's, verifying his results. This began a flurry of research activity. Correns (1900) drew attention to the phenomenon of complete gametic coupling or complete linkage, in which alleles of two or more different characters appeared to be always inherited together rather than independently, i.e. no recombination was observed between them. Although this seems to violate Mendel's Second Law, an obvious extension of his theory would be to assume that the genes for these characters are physically attached. Sutton (1903) formulated the chromosome theory of heredity, a major development. He pointed out the similarities between experimental observations on chromosomes and the properties which must be obeyed by the hereditary material under Mendel's Laws. In various organisms, chromosomes appeared to occur in homologous pairs, each pair sharing very similar physical characteristics, with one member of each pair inherited from the mother and the other from the father. Furthermore, during meiosis, i.e. the creation of gametes, the two chromosomes within each homologous pair line up next to each other, with apparently random orientation, and then are pulled apart into separate cells in the first meiotic division, so that each cell receives one chromosome at random from each homologous pair. In fact, the chromosomes each duplicate when they are lined up before the first meiotic division, so after that division, each cell actually contains two copies of each of the selected chromosomes. During the second meiotic division, these cells divide again, forming gametes, with each resulting gamete getting one copy of each chromosome from the cell. Still, the net result is that each gamete inherits from its parent one chromosome at random from each homologous pair. The chromosome theory of heredity provided a physical mechanism for Mendel's Laws if it were assumed that the independent Mendelian characters lay on different chromosomes, and that those which were completely linked lay on the same chromosome. An interesting complication to this simple story was first reported by Bateson, Saunders and Punnett (1905; 1906). In experiments on the sweet pea (Lathyrus odoratus), they studied two characters: flower color, with purple (dominant) and red (recessive) phenotypes, and form of pollen, with long (dominant) and round (recessive) phenotypes. They found that the two characters did not segregate independently, nor were they completely linked (see Table 2.1). When crosses were performed between a
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS
5
TABLE 2.1
The counts of observed and expected genotypes in Bateson, Saunders and Punnett's (1906) data. In each of the three subtables, the top margin represents form of pollen, and the left margin represents flower color.
P p
expected no linkage I L 1199.25 399.75 399.75 133.25
observed data L I 1528 106 117 381
expected complete linkage L I 1599 0 0 533
true-breeding parental line with purple flowers and long pollen (genotype PPLL) and one with red flowers and round pollen (genotype ppll), in the F2 generation, there were long and round pollen types and purple and red flowers, both in ratios of 3 to 1 of dominant to recessive types, following Mendel's First Law. However, among the purple flowered plants, there was a preponderance of long-type pollen over round in a ratio of 12 to 1, whereas among the red flowered plants, the round-type pollen was favored, with a ratio oflong to round type pollen of 1 to 3. The authors were baffled as to the explanation for this phenomenon which is now known as linkage or partial coupling, of which complete linkage or complete coupling is a special case. It was Thomas Hunt Morgan who was able to provide an explanation for Bateson, Saunders and Punnett's observations oflinkage and similar observations of his own on Drosophila melanogaster. Morgan (1911), building on a suggestion of de Vries (1903), postulated that exchanges of material, called crossovers, occurred between homologous chromosomes when they were paired during meiosis (see Figure 2.1). In the example of Bateson, Saunders, and Punnett (1905; 1906), if a parental line with purple flowers and long pollen were crossed with another having red flowers and round pollen, then the members of the Fl generation would each have, among their pairs of homologous chromosomes, a pair in which one of the chromosomes had genes for purple flowers and long pollen (PL) and the other had genes for red flowers and round pollen (pI). During meiosis, when these homologous chromosomes paired, if no crossovers occurred between the chromosomes in the interval between the genes for flower color and pollen form, then the resulting gamete would be of parental type, i.e. PL or pI. If crossing-over occurred between the chromosomes in the interval between the genes, the resulting gamete could instead be recombinant, PI or pL (see Figure 2.1). Without the crossover process, genes on the same chromosome would be completely linked with no recombination allowed, but they typically exhibit an amount of recombination somewhere
6
MARY SARA McPEEK
(b)
(a)
(d)
(e)
(e)
(f)
(a) During meiosis, each chromosome duplicates to form a pair of sister chromatids that are attached to one another at the centromere. The sister chromatids from one chromosome are positioned near those from the homologous chromosome, and those four chromatid strands become aligned so that homologous regions are near to one another. (b) At this stage, crossovers may occur, with each crossover involving a nonsister pair of chromatids. (c) At the first meiotic division, the chromatids are separated again into two pairs that are each joined by a centromere. (d) The resulting chromatids will be mixtures of the original two chromosome types due to crossovers. (e) In the second meiotic division, each product of meiosis receives one of the four chromatids. (j) depicts the same stage of meiosis represented by (b), but here only a portion of the length of the four chromatids is shown. Suppose that the interval depicted is flanked by two genetic loci. Consider the chromatid whose lower end is leftmost. That chromatid was involved in one crossover in the interval, thus its lower portion is dark and its upper portion is light, showing that it is a recombinant for the flanking loci. On the other hand, consider the chromatid whose lower edge is second from the left. That chromatid was involved in two crossovers in the interval, thus its lowermost and uppermost portions are both dark, showing that it is non-recombinant for loci at the ends of the depicted interval. In general, a resulting chromatid will be recombinant for an interval if it was involved in an odd number of crossovers in that interval.
FIG. 2.1.
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS
7
between perfect linkage (0% recombination) and independence (50% recombination). That the chance of recombination between genes on the same chromosome should be between 0 and 1/2 is a mathematical consequence of a rather general assumption about the crossover process, no chromatid interference, described later. Although we now know that crossing-over takes place among four chromosome strands, rather than just two, the essence of Morgan's hypothesis is correct. In diploid eukaryotes, during the pachytene phase of meiosis, the two chromosomes in each homologous pair have lined up next to each other in a very precise way, so that homologous regions are adjacent. Both chromosomes in each pair duplicate, and the four resulting chromosome strands, called chromatids are lined up together forming a very tight bundle. The two copies of one chromosome are called sister chromatids. Crossing-over occurs among the four chromatids during this phase, with each crossover involving a non-sister pair of chromatids. After crossing-over has occurred, the four resulting chromatids are mixtures of the original parental types. Following the two meiotic divisions, each gamete receives one chromatid. For genes on the same chromosome, a recombination occurs whenever the chromatid which is passed on to the gamete and which contains the two genes was involved in an odd number of crossovers between the genes (see Figure 2.1). 3. Linkage Analysis. A consequence of the crossover process, Morgan (1911) suggested, would be that characters whose genes lay closer together on a chromosome would be less likely to recombine because there would be a smaller chance of crossovers occurring between them. This is the key to linkage analysis: the smaller the amount of recombination observed between genes, i.e. the more tightly linked they are, the closer we could infer that they lie on a chromosome. This provides a way of locating genes relative to one another by observing the pattern of inheritance of the traits which they cause. It is remarkable that a comparison of various traits among family members may yield information on the microscopic structure of chromosomes. Despite many important advances in molecular biology since since Morgan's suggestion in 1911, linkage analysis is still a very powerful tool for localizing a gene of interest to a chromosome region, particularly because it may be used in cases where one has no idea where the gene is or how it acts on a biochemical level. Modern linkage analysis uses not only genes that code for proteins that produce observable traits, but also neutral markers. These are regions of DNA that are polymorphic, that is, they tend to differ from individual to individual, but unlike genes, the differences between alleles of neutral markers may have no known effect on the individual, although they can be detected by biologists. While these markers may not be of interest themselves, they can be mapped relative to one another on chromosomes and used as signposts against which to map genes of interest. Genes and
8
MARY SARA McPEEK
markers are both referred to as genetic loci. As an undergraduate student of Thomas Hunt Morgan, Sturtevant (1913) applied the principle of linkage to make the first genetic map. This consisted of a linear ordering of six genes on the X-chromosome of Drosophila, along with genetic distances between them, where he defined the genetic distance between two loci to be the expected number of crossovers per meiosis between the two loci on a single chromatid strand. He called this unit of distance one Morgan, with one one-hundredth of a Morgan, called a centiMorgan (cM), being the unit actually used in practice. Sturtevant (1913) remarked that genetic distance need not have any particular correspondence with physical distance, since as we now know, the crossover process varies in intensity along a chromosome. The crossover process generally cannot be observed directly, but only through recombination between the loci. For nearby loci, Sturtevant (1913) took the genetic distance to be approximately equal to the recombination fraction, i.e. proportion of recombinants, between them. Once he had a set of pairwise distances between the loci, he could order them. Of course, it is possible to have a set of pairwise distances which are compatible with no ordering, but in practice, with the large amount of recombination data typically obtained in Drosophila experiments, this does not occur. Sturtevant realized that the recombination fraction would underestimate the genetic distance between more distant loci, because of the occurrence of multiple crossovers. There are several obvious ways in which Sturtevant's (1913) method could be improved. First, the recombination fraction is not the best estimate of genetic distance, even for relatively close loci. Second, it is desirable to have some idea of the variability in the maps. Also, depending on what is known or assumed about the crossover process, it may be more informative to consider recombination events among several loci simultaneously. In order to address these issues properly it is necessary to have a statistical model relating observed recombinations to the unobserved underlying crossovers. We proceed to outline some of the issues involved. Haldane (1919) addressed the relationship between recombination and crossing-over through the notion of a map function, that is, a function M connecting a recombination probability r across an interval with the interval's genetic length d by the relation r = M(d). Haldane's best-known contribution is the map function he introduced, and which is now known by his name, M( d) = [1- exp( -2d)]/2. The Haldane map function arises under some very simple assumptions about the crossover process. Recall that crossing-over occurs among four chromatid strands, and that each gamete receives only one of the four resulting strands. We refer to the occurrence of crossovers along the bundle of four chromatid strands as the chiasma process. Each crossover involves exactly two ofthe four chromatids, so any given chromatid will be involved in some subset of the crossovers of the full chiasma process. The occurrence of crossovers along a given chromatid will be referred to as the crossover process. To obtain the Haldane map func-
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS
9
tion, assume first that the chiasma process is a (possibly inhomogeneous) Poisson process. Violation of the assumption is known as chiasIlla interference or crossover position interference. Second, assume that each pair of non-sister chromatids is equally likely to be involved in a crossover, independent of which were involved in other crossovers. This assumption is equivalent to specifying that the crossover process is obtained from the chiasma process by independently thinning (deleting) each point with chance 1/2. Violation of this assumption is known as chroIllatid interference, and the assumption itself is referred to as no chromatid interference (N CI). This pair of assumptions specifies a model for the occurrence of crossovers which is known as the No-Interference (NI) Illodel. Deviation from this model is known as interference, which encompasses both chiasma interference and chromatid interference. Since genetic distance is the expected number of crossovers d in an interval on a single chromatid strand, the assumption of NCI implies that the expected number of crossovers of the full chiasma process in the interval is 2d. Under the assumption of no chiasma interference, the chiasma process is then a Poisson process with intensity 2 per unit of genetic distance. To obtain the Haldane mapping function, we apply Mather's ForIllula (1935), which says that under the assumptionofNCI, r = [l-P(N = 0))/2, where r is the recombination probability across an interval, and N is the random variable corresponding to the number of crossovers in the chiasma process in that interval. Under the NI model, P(N = 0) = exp( -2d), giving the Haldane map function. Following is a well-known derivation of Mather's Formula (see e.g. Karlin and Liberman 1983): If we assume NCI, then each crossover has chance 1/2 to involve a given chromatid, independent of which chromatids are involved in other crossovers. In that case, if there are N crossovers in the chiasma process on an interval, with N > 0, then the chance of having i crossovers in the crossover process on a given chromatid is
1 2'
x -:-
1 2N -,
X --.
for 0 :S i :S N. On a given chromatid, a recombination will occur in the interval if the chromatid is involved in an odd number of crossovers in the interval. Thus, the chance of a recombination given that N > 0 crossovers have occurred in the chiasma process is
and the chance is 0 if N = 0, so the chance of a recombination is Pr(N > 0)/2. One consequence of Mather's Formula is that under NCI, the chance of recombination across an interval increases, or, at least, does not decrease,
10
MARY SARA McPEEK
as the interval is widened. Another is that the chance of recombination across any interval has upper bound 1/2 under Nel. These two observations appear to be compatible with virtually all published experimental results. Haldane's map function provides a better estimate of genetic distance than the recombination fraction used by Sturtevant (1913). Instead of estimating d by the observed value of r, one could instead plug the observed value ofr into the formula d = -1/2In(1- 21'). One could perform separate experiments for the different pairs of loci to estimate the genetic distances and hence obtain a map. Standard deviations could easily be attached to the estimates, since the number of recombinants in each experiment is binomial. One could also look at a number of loci simultaneously in a single experiment. Assuming that the experiment was set up so that all recombination among the loci could be observed, the data would be in the form of 2m counts, where m is the number of loci considered. This is because for each locus, it would be recorded whether the given chromosome contained the maternal or paternal allele at that locus. If we number the loci arbitrarily and assume that, for instance, the probability of maternal alleles at loci 1,3,4 and 5 and paternal alleles at loci 2 and 6 is equal to the probability of paternal alleles at loci 1,3,4 and 5 and maternal alleles at loci 2 and 6, then we could combine all such dual events and summarize the data in 2m - l counts. We index these counts by i, where i = (iI, i 2 , ... , i m- l ) E {O, l}m-l and ij = 0 implies that both loci ij and ij+l are from the same parent, i.e. there is no recombination between them, while i j = 1 implies that loci i j and ij+l are from different parents, i.e. they have recombined. Fisher (1922) proposed using the method of maximum likelihood for linkage analysis, and this is the method largely used today. We now describe the application, to the type of data described above, of the method of maximum likelihood using Haldane's NI model. This is the simplest form of what is known as multilocus linkage analysis. In a given meiosis, the NI probability of the event indexed by i is simply
m-l
Pi
= II oy (1 j=l
m-l
OJ )l-i j
= 1/2 II (1- e- 2dj )ij(1 + e- 2dj )1-i j , j=l
where OJ is the probability of recombination between loci i j and ij+l and dj is the genetic distance between them. The formula reflects the fact that under NI, recombination in disjoint intervals is independent. Note that the formulation depends crucially on the presumptive order of the markers. The same recombination event will have a different index i if the order of the markers is changed, and a different set ofrecombination probabilities or genetic distances will be involved in the above formula. For a given order,
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS 11
one can write down the likelihood of the data as
where ni is the number of observations of type i. The likelihood is maximized by
OJ =
2::
i:ij=l
ni -;-
2:: ni' i'
for all j, that is, just the observed proportion of recombinants between loci i j and i j +1. Since the assumption of NCI implies OJ :=:; 1/2, one usually takes the constrained maximum likelihood estimate, OJ = min(I:i:ij=l ni-;I:il ni', 1/2). All other recombination fractions between non-adjacent pairs of loci can be estimated by using the fact that under NI, if loci A, B, and C are in order ABC, then the chance of recombination between A and C, oAG, is related to the chance of recombination between A and B, 0AB, and that between Band C, OBG, by the formula OAG = OAB(I- OBG) + (1oAB )OBG. The variance in the estimate OJ is OJ (1- OJ )/n, and OJ and OAk are independent for j # k. Thus, under the assumption of NI, the multilocus linkage analysis reduces to a pairwise analysis of recombination between adjacent markers when the data are in the form given above. To estimate order, one may consider several candidate orders and maximize the appropriate likelihood under each of them. The maximum likelihood estimate of order is that order whose maximized likelihood is highest. When one wants to map a new locus onto a previously existing map, one can follow this procedure, considering as candidate orders those orders in which the previously mapped loci are in their mapped positions and the new locus is moved to different positions between them. Outside of the world of experimental organisms, the reality of multilocus linkage analysis is quite different from what has been portrayed so far. Humans cannot be experimentally crossed, and therefore human linkage data does not fit neatly into 2m - 1 observed counts. In some individuals, maternal and paternal alleles may be identical at some loci, so that recombination involving those loci cannot be observed in their offspring. Ancestors may not be available for the analysis, so it may not be possible to definitively determine whether particular alleles are maternally or paternally inherited. When some information is missing, the information that is available may be in the form of complicated pedigrees representing interrelationships among individuals. In these cases, multilocus linkage analysis under NI does not reduce to a pairwise analysis. Maximization of the NI likelihood is an extremely complex undertaking and is the subject of considerable current research. For an introduction to linkage analysis in humans, see Ott (1991).
12
MARY SARA McPEEK
Most linkage analyses, whether in humans or in experimental organisms, are today still performed using the NI model. In fact, the phenomenon of interference is well-documented in a wide range of organisms. In their experiments on Drosophila, Sturtevant (1915) and Muller (1916) noticed that crossovers did not seem to occur independently, but rather the presence of one seemed to inhibit the formation of another nearby. From recombination data, it may be impossible to distinguish whether observed interference is due to chromatid interference, chiasma interference, or both, because of a lack of identifiability. If the chiasma and crossover processes themselves could be observed, this would eliminate the difficulty. In certain fungi such as Saccharomyces cerevisiae, Neurospora crass a, and Aspergillus nidulans, the problem is made less acute for two reasons. First of all, these genomes are very well mapped, with many closely spaced loci, and for certain very near loci, the observation of a recombination or not between them is nearly equivalent to the observation of a crossover or not between them. Secondly, in these organisms, all four of the products of meiosis can be recovered together and tested for recombination. This type of data is known as tetrad data, as opposed to single spore data in which only one of the products of meiosis is recovered. As a result of these features, some tetrad data give approximate discretized versions of the chiasma and crossover processes. From this sort of data, it is clear that chiasma or position interference is present, and that the occurrence of one crossover inhibits the formation of another nearby (Mortimer and Fogel 1974). The existence and nature of chromatid interference has proved more difficult to detect than position interference. Statistical tests of chromatid interference based on generalizations of Mather's formula demonstrate some degree of chromatid interference, but the results are not consistent from experiment to experiment (Zhao, McPeek, Speed, 1995). Various crossover models that allow for interference of one or both types have been put forward and examined. These include Fisher, Lyon and Owen (1947), Owen (1949,1950), Carter and Robertson (1952), Karlin and Liberman (1979), Risch and Lange (1979), Goldgar and Fain (1988), King and Mortimer (1990) Foss, Lande, Stahl, and Steinberg (1993), McPeek and Speed (1995), Zhao, Speed, and McPeek (1995). The model used overwhelmingly today in linkage analysis is still the no interference model, due to its mathematical tractability. However, the chi-square model of Foss, Lande, Stahl, and Steinberg (1993), McPeek and Speed (1995), and Zhao, Speed, and McPeek (1995) may now be a viable contender. 4. Conclusion. Mendel showed that through careful quantitative observation of related individuals, the mechanism of heredity of traits could be studied. Linkage analysis, proposed by Morgan in 1911 and still used today, is equally startling in that it is based on the principle that careful quantitative observation of related individuals can actually illuminate the positions of genes on chromosomes. While the phenomenon of linkage
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS
13
between traits allows one to infer that their genes are on the same chromosome, it is the phenomenon of recombination, that has the effect of varying the degree of linkage, which allows these traits to be mapped relative to one another on the chromosome. One of the most useful characteristics of linkage analysis is the fact that it can be used to map genes that are identified only through their phenotypes, and about which one may have no other information. 5. Recommended reading. Whitehouse (1973) gives a thorough historical introduction to genetics. Bailey (1961) is a detailed mathematical treatment of genetic recombination and linkage analysis, while Ott (1991) is an introductory reference for genetic linkage analysis in humans. Acknowledgements. I am greatly indebted to Terry Speed for much of the material in this manuscript. This work was supported in part by NSF Grant DMS 90-05833 and NIH Grant R01-HG01093-01. REFERENCES Bailey, N. T. J. (1961) Introduction to the Mathematical Theory of Genetic Linkage, Oxford University Press, London. Bateson, W., Saunders, E. R., and Punnett, R. C. (1905) Experimental studies in the physiology of heredity, Rep. Evol. Comm. R. Soc., 2: 1-55, 80-99. Bateson, W., Saunders, E. R., and Punnett, R. C. (1906) Experimental studies in the physiology of heredity, Rep. Evol. Comm. R. Soc., 3: 2-11. Carter, T. C., and Robertson, A. (1952) A mathematical treatment of genetical recombination using a four-strand model, Proc. Roy. Soc. B, 139: 410-426. Correns, C. (1900) G. Mendels Regel iiber das Verhalten der Nachkommenschaft der Rassenbartarde, Ber. dt. bot. Ges., 18: 158-168. (Reprinted in 1950 as "G. Mendel's law concerning the behavior of progeny of varietal hybrids" in Genetics, Princeton, 35: suppl. pp. 33-41). de Vries, H. (1900) Das Spaltungsgesetz der Bastarde, Ber. dt. bot. Gesell., 18: 83-90. (Reprinted in 1901 as "The law of separation of characters in crosses", J. R. Hort. Soc., 25: 243-248. de Vries, H. (1903) Befruchtung and Bastardierung, Leipzig. (Reprinted as "Fertilization and hybridization" in C. S. Gager (1910) Intracellularpangenesis including a paper on fertilization and hybridization, Open Court Publ. Co., Chicago, pp. 217-263). Fisher, R. A. (1922) The systematic location of genes by means of crossover observations, American Naturalist, 56: 406-411. Fisher, R. A. (1936) Has Mendel's work been rediscovered? Ann. Sci., 1: 115-137. Fisher, R. A., Lyon, M. F., and Owen, A. R. G. (1947) The sex chromosome in the house mouse, Heredity, 1: 335-365. Foss, E., Lande, R., Stahl, F. W., Steinberg, C. M. (1993) Chiasma interference as a function of genetic distance, Genetics, 133: 681-691. Goldgar, D. E., Fain, P. R. (1988) Models of multilocus recombination: nonrandomness in chiasma number and crossover positions, Am. J. Hum. Genet., 43: 38-45. Haldane, J. B. S. (1919) The combination of linkage values, and the calculation of distances between the loci of linked factors, J. Genetics, 8: 299-309. Karlin, S. and Liberman, U. (1979) A natural class of multilocus recombination processes and related measures of crossover interference, Adv. Appl. Prob., 11: 479-501. Karlin, S. and Liberman, U. (1983) Measuring interference in the chiasma renewal formation process, Adv. Appl. Prob., 15: 471-487. King, J. S., Mortimer, R. K. (1990) A polymerization model of chiasma interference and corresponding computer simulation, Genetics, 126: 1127-1138.
14
MARY SARA McPEEK
Mather, K. (1935) Reduction and equational separation of the chromosomes in bivalents and multivalents, J. Genet., 30: 53-78. McPeek, M. S., Speed, T. P. (1995) Modeling interference in genetic recombination, Genetics, 139: 1031-1044. Mendel, G. (1866) Versuche tiber Pflanzenhybriden, Verh. naturJ. Ver. Bruenn, 4: 3-44. (Reprinted as "Experiments in plant-hybridisation" in Bateson, W. (1909) Mendel's principles of heredity, Cambridge Univ. Press, Cambridge, pp. 317-361.) Morgan, T. H. (1911) Random segregation versus coupling in Mendelian inheritance, Science, 34: 384. Mortimer, R. K. and Fogel, S. (1974) Genetical interference and gene conversion, in R. F. Grell, ed., Mechanisms in Recombination, Plenum Publishing Corp., New York, pp. 263-275. Muller, H. J. (1916) The mechanism of crossing-over, American Naturalist, 50: 193-221, 284-305,350-366,421-434. Ott, Jurg (1991) Analysis of human genetic linkage, rev. ed., The Johns Hopkins University Press, Baltimore. Owen, A. R. G. (1949) The theory of genetical recombination, I. Long-chromosome arms. Proc. R. Soc. B, 136: 67-94. Owen, A. R. G. (1950) The theory of genetical recombination, Ad'll. Genet., 3: 117-157. Risch, N. and Lange, K. (1979) An alternative model of recombination and interference, Ann. Hum. Genet. Lond., 43: 61-70. Sturtevant, A. H. (1913) The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association, J. Exp. Zool., 14: 43-59. Sturtevant, A. H. (1915) The behavior of the chromosomes as studied through linkage, Zeit. J. indo Abst. u. Vererb., 13: 234-287. Sutton, W. S. (1903) The chromosomes in heredity, Bioi. Bull. mar. bioi. Lab., Woods Hole, 4: 231-248. Tschermak, E. von (1900) Uber kiinstliche Kreuzung bei Pisum sati'Uum, Ber. dt. bot. Ges., 18: 232-239. (Reprinted in 1950 as "Concerning artificial crossing in Pisum sati'Uum" in Genetics, Princeton, 26: 125-135). Whitehouse, H. L. K. (1973) Towards an understanding of the mechanism of heredity, St. Martin's Press, New York. Zhao, H., McPeek, M. S., Speed, T. P. (1995) A statistical analysis of chromatid interference, Genetics, 139: 1057-1065. Zhao, H., Speed, T. P., McPeek, M. S. (1995) A statistical analysis of crossover interference using the chi-square model, Genetics, 139: 1045-1056.
MONTE CARLO METHODS IN GENETIC ANALYSIS SHILl LIN" Abstract. Many genetic analyses require computation of probabilities and likelihoods of pedigree data. With more and more genetic marker data deriving from new DNA technologies becoming available to researchers, exact computations are often formidable with standard statistical methods and computational algorithms. The desire to utilize as much available data as possible, coupled with complexities of realistic genetic models, push traditional approaches to their limits. These methods encounter severe methodological and computational challenges, even with the aid of advanced computing technology. Monte Carlo methods are therefore increasingly being explored as practical techniques for estimating these probabilities and likelihoods. This paper reviews the basic elements of the Markov chain Monte Carlo method and the method of sequential imputation, with an emphasis upon their applicability to genetic analysis. Three areas of applications are presented to demonstrate the versatility of Markov chain Monte Carlo for different types of genetic problems. A multilocus linkage analysis example is also presented to illustrate the sequential imputation method. Finally, important statistical issues of Markov chain Monte Carlo and sequential imputation, some of which are unique to genetic data, are discussed, and current solutions are outlined.
1. Introduction. Most human genetic analyses require computation of probabilities and likelihoods of genetic data from pedigrees. Statistical methods and computational algorithms have been developed to accomplish this task. The most efficient ones have been based on a recursive algorithm. The simplest case of which was developed by Elston and Stewart (1971). Successive algorithms for more complex cases were given in Lange and Elston (1975), Cannings, Thompson and Skolnick (1978), Lange and Boehnke (1983), and Lathrop et al. (1984). Unfortunately, these methods are sometimes incapable of handling the data that geneticists and genetic epidemiologists are facing today. The past decade has seen an explosive growth of molecular genetic technology which has led to a massive amount of DNA data becoming available to researchers; see Murray et al. (1994) for a comprehensive human linkage map. It is imperative that these data be utilized as much as possible to maximize the power of, for example, constructing genetic maps, mapping disease genes and finding plausible genetic models. Practical and theoretical bounds on computational feasibility of probabilities and likelihoods become a major limitation of genetic analysis and a great challenge in statistical genetics. A routine multipoint linkage analysis using LINKAGE developed by Lathrop et al. (1984) may take a few weeks or even months to do certain problems, such as those encountered by Schellenberg et al. (1990, 1992) and Easton et al. (1993). This is too impractical and expensive and hence unacceptable for regular screening processes. Advanced computing technology and good computer programming practice have been used to overcome some of the computational difficul• Department of Statistics, University of California, Berkeley, CA 94720.
15
16
SHILl LIN
ties with multipoint analysis. Cottingham et al. (1993) and Schaffer et al. (1994) demonstrated that basic computer science techniques such as "common sub-expression elimination" by factoring expressions to reduce arithmetic evaluations, can be used to improve the performance of algorithms. These techniques have proved to be quite effective in exploiting the basic biological features such as the "sparsity" of the joint genotype array and the "similarity" of genotypes. Furthermore, Miller et al. (1991), Goradia et al. (1992), and Dwarkadas et al. (1994) investigated the usage of parallel computers as another way to achieve speedup, which has become less and less expensive with the advance of computer technology. However, as pointed out by Cottingham et al. (1993), although the improvements are substantial, there will always be more difficult problems that geneticists want to solve and will demand yet more computer power. Therefore, good computer programming practice should be combined with advances in statistical methods to achieve even greater improvements. In the last few years, a completely different approach involving the estimation of probabilities and likelihoods via the Monte Carlo method has emerged. We include under this heading the sequential imputation approach of Kong et al. (1994) and Irwin et al. (1994), and the Markov chain Monte Carlo (MCMC) approaches of Lange and Matthysse (1989), Sheehan (1990), Lange and Sobel (1991), Thompson and Guo (1991), Guo and Thompson (1992), Thomas and Cortessis (1992), Sheehan and Thomas (1993), Lin et al. (1993), and Thompson (1994a,b). These methods have been successfully applied to various problems, and some of which will be demonstrated later as examples. 2. Monte Carlo methods in genetic analysis. Although Monte Carlo simulation methods have been proposed for some time in human pedigree analysis, they have only recently emerged as a practical alternative to analytical statistical methods. Traditionally, simulation methods have been used to study some unknown properties of analysis methods, or to compare the performances of alternative methods. The use of Monte Carlo simulation methods as tools to provide solutions to problems for which analytical solutions are impractical was not pursued until quite recently. Preliminary investigations have revealed that these methods are of particular relevance to genetic analysis problems for which complex traits, complex genealogical structures or large numbers of polymorphic loci are involved. Simulation methods in genetics can be traced back to the 1920's, when Wright and McPhee (1925) estimated inbreeding by making random choices in tracing ancestral paths for livestocks. Ott (1974,1979) advocated the use of simulation methods as a tool for human pedigree analysis, but this did not receive much attention at the time. More recently, a straightforward Monte Carlo method known as gene-dropping was proposed by MacCluer et al. (1986). First, genotypes for founders are generated according to
MONTE CARLO METHODS IN GENETIC ANALYSIS
17
the relevant population probabilities. Next, gene flow down the pedigree is simulated according to the rules of inheritance postulated by Mendel (1865). Finally, outcomes which are inconsistent with the observed phenotypes are discarded. This results in a random sample from the genotypic configuration space. Approximations to any desired probabilities can thus be obtained by Monte Carlo methods. In small pedigrees, the method will successfully produce realizations of genotypes consistent with phenotypes. However, this method does not work well in pedigrees of even moderate size for in such cases it is extremely unlikely to give samples which are compatible with observed phenotypes. Ploughman and Boehnke (1989) described a Monte Carlo method to estimate the power of a study to detect linkage for a complex genetic trait, given a hypothesized genetic model for the trait. They proposed to calculate conditional probabilities recursively and then sample from the posterior genotype distribution conditional on the observed phenotypes at the trait locus. These conditional probabilities are generated in the process of calculating the likelihood of a pedigree by using the procedure of Lange and Boehnke (1983), a generalization of Elston and Stewart (1971). Then marker genotypes are subsequently simulated conditional on the simulated trait genotypes (Boehnke, 1986). This method reduces exact computations on two loci jointly to exact computations on the trait locus only. However it is necessary to store a large amount of intermediate data, especially when the method is extended to complex pedigrees with inbreeding loops. The limitations of this method are the same as other methods based on the Elston-Stewart algorithm. Ott (1989) also described a simulation method for randomly generating genotypes at one or more marker loci, given observed phenotypes at loci linked among themselves and with the marker. In the past decade, statisticians have realized that many problems previously thought intractable can be solved fairly straightforwardly by Markov chain Monte Carlo (MCMC) methods. The method was proposed long ago and has been widely used in statistical physics, see Metropolis et al. (1953) for the original work, Rikvold and Gorman (1994) and references therein for a review of recent works. Since the work of Geman and Geman (1984), MCMC has received a great deal of attention in the statistical community, especially in Bayesian computation. The papers of Tanner and Wong (1987), Gelfand and Smith (1990), Smith and Roberts (1993) and Gilks et al. (1993) are a few examples of recent research in this area. Following its entry into statistics, MCMC was quickly adapted to genetic analysis. The basic idea is to obtain dependent samples (essentially realizations of Markov chains) of underlying genotypes consistent with the observed phenotypes. Probabilities and likelihoods can then be estimated from these dependent samples. Lange and Matthysse (1989) investigated the feasibility of one MCMC method, the Metropolis algorithm, to simulate genotypes for traits conditional upon observed data. Independent
18
SHILl LIN
of the work of Lange and Matthysse, Sheehan, in her 1990 PhD thesis, investigated the use of the Gibbs Sampler of Geman and Geman (1984) to sample genotypes underlying simple discrete genetic traits observed on large pedigrees. She demonstrated that, for a trait at a single diallelic locus, the Gibbs sampler provided quite accurate estimates of the ancestral probabilities of interest in a complex pedigree of Greenland Eskimos. Guo and Thompson (1992) showed that the Gibbs sampler can also be applied to quantitative traits. Monte Carlo EM algorithms were developed, in conjunction with Monte Carlo likelihood ratio evaluation by Thompson and Guo (1991), to estimate parameters of complex genetic models. Lange and Sobel (1991) and Thomas and Cortessis (1992) developed MCMC methodologies relevant for two-point linkage analysis. The validity of these methods rests on the crucial assumption that any locus involved must be diallelic. This is undesirable, particularly in linkage analysis, because multiallelic markers in general are much more informative, and thus highly preferred. The research of Sheehan and Thomas (1993), Lin et al. (1993, 1994b) and Lin (1995) have addressed this issue so that MCMC methods can be applied to more realistic genetic data where other methods fail. The sequential imputation method of Kong et al. (1994) is another Monte Carlo method that has recently been implemented for multilocus linkage problems by Irwin et al. (1994) and Irwin (1995). It is essentially an importance sampling technique (see e.g., Hammersley and Handscomb (1964)) in which missing data on genetic loci are imputed conditional on the observed data. Genetic loci are ordered and processed one at a time. Previously imputed values are treated as observed for later conditioning. By repeating the process for many times, a collection of complete data sets are obtained with associated weights to assure appropriate representation of the probability distribution. This method has been demonstrated to be a computationally efficient approach to problems with a large number of loci and simple pedigrees, i.e. pedigrees without loops. For pedigrees with many loops, it has the same limitations as other methods based on the Elston-Stewart algorithm. The rest of this paper is devoted to the discussion of methodology and applications of MCMC and sequential imputation to genetic problems. We first review the basic MCMC algorithm and how it can be applied to genetic analysis. We then present three applications of MCMC to genetic problems. The first application is on inference of ancestral probabilities on complex pedigrees, the second application is on estimating likelihoods in multipoint linkage analysis, and the last is on inference with complex traits. The method of sequential imputation and its application to a multilocus linkage problem will follow. Finally, we discuss several specific statistical issues associated with the applications of MCMC and sequential imputation to genetic problems.
MONTE CARLO METHODS IN GENETIC ANALYSIS
19
3. Markov chain Monte Carlo methods. Whether one is interested in computing the probability that a certain individual carries a gene for a recessive trait, or the multilocus likelihood function in a linkage analysis, the problem can almost always be viewed as estimating an expectation with respect to the conditional genotype distribution Pe(g I d). Here, g is the configuration of genotypes (they could be either single locus or multilocus, depending on the context of the application), d is the observed phenotypic data and () is a vector of parameters. Thus, the objective is to simulate from the distribution Pe(g I d), so that the relevant expectation can be estimated by a sample average. Note that although
Pe(g I d) ex Pe(d I g)Pe(g), computation of the normalizing constant
Pe(d) = LPe(d I g)Pe(g) g
is usually formidable. Since the distribution of interest Pe(g I d) is therefore known only up to a normalizing constant, direct simulation from it is impossible. Note that Pe( d) is the likelihood and is sometimes of interest itself. The Metropolis-Hastings family of algorithms are MCMC methods which provide ways of simulating dependent realizations that are approximately from a distribution that is known only up to a constant of proportionality (Hastings, 1970). In other words, Metropolis-Hastings algorithms are methods of constructing Markov chains with the distribution of interest as a stationary distribution. In the genetic analysis setting discussed in the current paper, the distribution of interest is discrete and the state space is finite. The general Hastings algorithm employs an auxiliary function q(g*, g) such that q(., g) is a probability distribution for each g. The following algorithm defines the required Markov chain (Hastings, 1970). Let g(l) be the starting state of the Markov chain. Successive states are then generated iteratively. Given that the current state is g(t), t = 1,2,···, generation of the next state g(t + 1) follows these steps: 1. Simulate a candidate state g* from the proposal distribution q(., g(t)) as specified above; 2. Compute the Hastings acceptance probability
r
= r(g
*
, g(t))
. { Pe(g* I d) q(g(t), g*) } Pe(g(t) I d) q(g*, g(t))' 1 ,
= mm
which is so designed that the Markov chain will indeed have P = Pe(- I d) as a stationary distribution; 3. Accept g* with probability r. That is, with probability r, the Markov chain moves to g(t+1) = g*. Otherwise, the chain remains at g(t + 1) = g(t).
20
SHILl LIN
It can be verified easily that the distribution of interest Pe(g I d) is indeed a stationary distribution of the Markov chain just defined (Lin, 1993). Note that P is used in the algorithm only through ·the ratio in computing the Hastings acceptance probability, that is why we emphasize that P only needs to be known up to a constant. Provided that the auxiliary function is chosen so that the chain is ergodic, that is, aperiodic and irreducible, realizations of the chain (after a sufficient number of steps for convergence) can be regarded as from Po(g I d). These realizations can then be used to estimate the required expectation. Performance of the estimate depends on the choice of the auxiliary function q. A special case of the Hastings algorithm is the Metropolis algorithm (Metropolis et al., 1953). If the auxiliary function is symmetric, that is, q(g*,g) = q(g,g*), then the acceptance probability is min{Pe(g* I d) / Po (g I d), I}. Therefore, if the candidate state is at least as probable as the current state, then the process moves to the new state, otherwise, the process moves to the new state according to the odds ratio of the proposal state and the current state. Another special case of the Hastings algorithm is the Gibbs sampler (Geman and Geman, 1984). Specifically, for the Gibbs sampler, each coordinate of g = (gl,g2," ',gn) is updated in turn, where gi is the genotype (again, it could be single-locus or multi-locus) of the ith individual in the pedigree and n is the size of the pedigree. When updating the ith coordinate gi, the proposal distribution q is chosen to be pJi)(gi I g_i,d), where g-i = (gl,···,gi-1,gi+1,···,gn), the configuration of genotypes of individuals in the pedigree except the ith individual. Denote g* = (gl,···,gi-1,gi,gi+1,···,gn). Since Pe(g* I d)pJi)(9i I g:'i,d) = Po(g I d)pJi)(gi I g-i,d) for any i E {l,···,n}, any proposed candidate g* is accepted with probability 1. When all the coordinates are updated once, that constitutes a scan. Assuming Mendelian segregation, the conditional genotype distribution pJi\gi I g-i, d) of an individual for Gibbs updating depends only on the phenotype of the individual and the current genotypes of the neighbors, who are the parents (if not a founder), spouses and offspring. Hence the Gibbs sampler is easy to implement due to this local dependence structure. However, one should note that the fact of no rejection is not necessarily advantageous; the Gibbs sampler can make only small changes in g. Nevertheless, the Gibbs sampler has been used extensively in genetic analysis, not only because it is easy to sample from the conditional distributions, but also because other proposal distributions may result in rejecting almost all the proposed candidate states. Standard errors are frequently employed to assess the estimates. If a Markov chain is aperiodic and irreducible with a finite state space, then the following central limit theorem holds. That is, in estimating an expectation
MONTE CARLO METHODS IN GENETIC ANALYSIS
J-l
21
Ep(f(g)) by 1 N
Ii = N
L
f(g(t)),
t=l
we may assert that
where f is P-integrable and a} can be estimated. Following Hastings (1970), we divide the realization {g(t); 1 ~ t ~ N} into L batches, each of which consists of K consecutive observations (K L = N) of the genotypic configuration g. Let iii denote the [th batch mean, then 2
Sp
L (~ ~)2 ~ J-l1-J-l
= L1=1
L(L -1)
provides a satisfactory estimate of a} / N, provided the batch means are not significantly autocorrelated. Hence sp is the estimated Monte Carlo standard error of Ii. In theory, MCMC methods can be easily applied to estimate probabilities and likelihoods of interest in many areas of applications. Many technical problems exist in practice, however. Specifically, the following are some of the main problems associated with the application of MCMC to genetic analysis. First of all, finding a starting configuration of genotypes which is consistent with the observed data is a non-trivial problem. Furthermore, a Markov chain constructed from the Gibbs sampler may not be irreducible, a necessary requirement for the inference to be valid. The distribution of interest Pe(g I d) usually has multiple modes, which is another difficult problem facing MCMC exploration of the probability surface. These problems will be addressed in detail in section 6. 4. Applications of MCMC to three genetic problems. Three specific types of problems using MCMC methods are discussed and possible solutions are described in this section. Genetic pedigree analysis consists of three components: the genealogical structure (pedigree), the mode of inheritance (genetic model) for the trait of interest, and the observed data (phenotypes). Our first application assumes all the three components are known, and that one is primarily interested in the probability that a certain individual carries a specific gene. This type of problem usually occurs with large and complex genealogical structures. The second application is to map a locus to a known map of markers using multipoint linkage analysis, where the number of markers and number of alleles per marker are too large to be treated by analytical methods using standard packages. The third application involves inference concerning the mode of inheritance of a complex trait, assuming that the other two components are known.
22
SHILl LIN
Complex models are usually needed to describe this type of genetic data adequately. These three examples demonstrate that MCMC methods are techniques which can be applied to a large class of problems that are not amenable to treatment by standard exact methods and pedigree analysis packages. 4.1. Inference of ancestral probabilities on complex pedigrees MCMC methods are applied here to estimate the probabilities that specific founder individuals carry a gene, given the phenotypic data on large pedigrees which are also very complex, i.e. with many inbreeding loops. These probabilities may be of interest in population genetics or genetic counseling. One example of such is a problem which concerns the estimation of allele frequency of the B-gene among Greenland Eskimos (Sheehan, 1990). Another example is the estimation of founder carrier probabilities for a very rare recessive lethal trait in a Hutterite genealogy (Lin et al., 1994a). Genetic models for this type of problems are usually quite simple. However, these populations are often isolated because of geographic or religious reasons. The pedigrees are thus very complex, with many loops, which make it impossible to compute exactly using standard methods of pedigree analysis, due to insufficient computer memory. Figure 4.1 depicts the complexity of the Hutterite genealogy studied by Lin et al. (1994a). Two Hutterite families were observed to segregate the very rare recessive lethal infantile hypophosphatasia. The ancestors of the two affected individuals were traced back 11 generations to 48 founders, giving a 221-member pedigree. The genealogy of the Greenland Eskimos studied by Sheehan (1990) is even more complex and will not be shown here. By employing a MCMC algorithm with an appropriately chosen auxiliary function q, one obtains N Monte Carlo realizations g(t), t = 1, ... , N. These realizations can be regarded (approximately) as from P(g I d), the joint posterior distribution of genotypes on the pedigree, conditional on the phenotypic data. From these realizations, any expectation under the conditional distribution can be estimated. To be specific, consider a recessive lethal trait with A denoting the normal allele and a the disease allele. Then the estimate of the probability that individual j was a carrier is h
P(Yj
1
= Aa) = N
E I(Yj(t) = Aa), N
t=l
where I is the indicator function. That is, the estimated probability is simply the proportion of realizations in which j has genotype Aa. Lin et al. (1994a) used a modified Gibbs sampler with N = 1,000,000 realizations to obtain their results. There, they were mainly interested in which one of the 48 founders was most likely to have introduced the mutant gene into the population. The estimated probabilities show that founders 1,2,3,4,6 and 7 (shaded grey in Figure 4.1) were all much more probable carriers than the other founders. Founder 1 (with probability 0.197) was
MONTE CARLO METHODS IN GENETIC ANALYSIS
23
...... FIG. 4.1. Marriage node graph of a Hutterite pedigree with the two individuals affected by HOPS shaded black. The six founders of main interest, shaded grey, are 1, 2, 3, 4, 6 and 7.
by far the most probable carrier, which is expected by simply observing relationships of individuals in the pedigree. The carrier probabilities of these six founders and their estimated standard errors are shown in table 4.1. Founders 17, 18,56,57 and 58 (also shaded grey in Figure 4.1) were the only additional founders whose probabilities of being carriers were higher than 5%. See Lin et al. (1994a) for more details. 4.2. Estimation of likelihoods in multipoint linkage analysis Computing multilocus likelihood is an essential part of multipoint linkage analysis. However, due to the large amounts of data now available, standard methods and algorithms, such as LINKAGE (Lathrop et al., 1984), are sometimes impractical. Ott (1991) provides a detailed account of, and basic genetic elements pertinent to linkage analysis. The computation required for the likelihood analysis using LINKAGE, grows exponentially. Factors that contribute to increased computational demand are mostly due to the following: number of markers, number of alleles per marker, number of unobserved individuals and degree of complexity of a pedigree (Cottingham
24
SHILl LIN TABLE 4.1
Estimated posterior carrier probabilities, conditional on the data, obtained by Lin et al. (1994a), for the Hutterite pedigree and data in figure 4.1. Listed are the six founders with relatively higher probabilities of being carriers.
founder label 1 2 3 4
6 7
carner probability 0.197 0.099 0.109 0.109 0.105 0.113
standard error 0.012 0.005 0.006 0.006 0.010 0.010
et al., 1993). The lod score of multipoint linkage analysis is the common logarithm of the likelihood ratio Ld L o, where h is the likelihood under linkage and Lo is the likelihood in the absence of linkage. In the context of mapping a new locus to a known map of markers, the multipoint lod score can be expressed as lod(e) = log (L(e)jL(e o )), where e specifies the map position of the locus in question relative to the known marker map, and eo is the special case in which the new locus is unlinked to any of the markers. Note that
L(e)
= Pe(d) = L
Pe(d I g)Pe(g),
g
where g = (gl,"', gn) is a configuration of multilocus genotypes. A straightforward approximation of L( e) would be using the method of genedropping as described in section 2. Outcomes which are incompatible with the observed phenotypic data are discarded and the likelihood is approximated by averaging over the remaining ones. As pointed out earlier, this method does not work in pedigrees of even moderate size because it is extremely unlikely to produce samples which are compatible with observed phenotypes in such cases. Note that, lod(e)
The last expression of the above formulae is the conditional expectation with respect to the distribution Peo(g I d).
MONTE CARLO METHODS IN GENETIC ANALYSIS
25
Estimation of the whole lod score curve as a function of e can therefore be done by simulation at a single eo. Specifically, let g(t) : t = 1,2,···, N, be N realizations of an ergodic Markov chain with POD (g I d) as its equilibrium distribution. Then, N
" Po(g(t), d) 1og -1 'L..J N t=l POD(g(t), d)
provides an estimate for lod(e). For e close enough to eo, the estimate will be good, as the sampling distribution POD is not far apart from the target distribution Po. Therefore, it is desirable to sample at several e values spread out through the range and perform likelihood ratio evaluations at nearby values only. The following example offers an illustration of the effectiveness of the Monte Carlo multipoint linkage analysis method described above. The data come from a set of pedigrees studied by Palmer et al. (1994). The objective here is to map CSF1R relative to a map spanned by the markers D5S58, D5S72, D5S61, D5S211, in that order, on Chromosome 5. The recombination frequencies between the successive pairs of adjacent markers are 0.22, 0.09, and 0.36. The number of alleles for these loci range from 3 to 8. The multilocus genotypic configurations g(t),t = 1,···,N, were generated using a modified Gibbs sampler in which multilocus genotypes are updated individual-by-individual and locus-by-locus (Lin and Wijsman, 1994). Figure 4.2 shows a lod score curve with the lod scores estimated from the method described above. The x-axis plots genetic distance in centimorgans, while the y-axis plots the lod score. For this example, exact computation is still feasible so that the exact solutions can be compared to the estimates from MCMC, as shown in Figure 4.2. It is clear from the picture that MCMC produces a satisfactory estimate to the exact lod score curve and it only required 1/15 of the CPU time needed for computation using LINKAGE (Lin and Wijsman, 1994). With an additional marker, exact computation would no longer be practical so that MCMC approximation becomes an essential tool. 4.3. Inference of the mode of inheritance for complex traits Many common genetic diseases have exhibited both genetic and non-genetic components. These components may interact with one another leading to the manifestation of the disease. These traits are not simple Mendelian traits. In order to be able to describe them adequately, complex models are usually needed. This is especially important for localizing disease genes, because linkage analysis is sensitive to misspecification of the model. Furthermore, using larger pedigrees is usually more powerful than using smaller pedigrees, such as nuclear families. Complexity of the model and large complex pedigrees prevent the usual methods to be feasible. Approximation methods exist, such as PAP (Hasstedt and Cartwright, 1979). However, it has been almost impossible to evaluate performance
26
SHILl LIN
3
Linkage MCMC
2
-
L
0
d
-1
0
_.- ......
-2 -3 -4 -80
-40
o
40
80
120
160
centimorgans FIG. 4.2. Five-point lod score curve obtained by MCMC using the method of Lin (1995). Exact values from LINKAGE (Lathrop et al., 1994) are also shown for comparison.
of these methods. Therefore, MCMC has been explored as an alternative technique to fully utilize genetic information available. The role of MCMC is two-fold. On one hand, MCMC can itself be used as a method to estimate parameters of the model. On the other hand, MCMC can be used to check the validity of other approximation methods, because MCMC can achieve any degree of accuracy as long as the process is run for sufficient time. The latter may be of greater value, because other approximation methods are usually less computationally intensive and hence are preferred if they yield satisfactory results. Guo and Thompson (1992) proposed a Monte Carlo method for estimating the parameters of a complex model by utilizing realizations from the Gibbs sampler. The method was however restricted to data from diallelic genetic systems only. Further work was undertaken by Lin (1993) to extend these methods to data from multi allelic loci. We consider a mixed model, which is usually used for investigating the mode of inheritance of complex traits. The observed quantitative trait data, d, is modeled as influenced additively by the covariates (e.g. sex, age), the major gene, the additional polygenic heritable component, and the environment. Let f3 denote the vector of fixed effects, including the major gene effects for a given configuration of genotypes. Let a denote the vector of polygenic effects which are assumed jointly distributed as N(O, ()'~A), where A is the known numerator relationship matrix (Henderson, 1976). Let e denote the vector of residuals (thought of as the environmental effects)
MONTE CARLO METHODS IN GENETIC ANALYSIS
27
with a joint distribution N(O, u;I). Then for a given configuration of major genotypes and polygenic effects, the mixed model can be specified as the following: d
= X{3 +a+e,
where X is the design matrix for fixed effects. We are mainly interested in estimating the vector {3 and the variances u~ and Data from an informative genetic marker is incorporated into the estimation process so that the parameters of the model can be estimated more accurately. Therefore, if we let m denote the observed marker data and B denote the vector of parameters, including (3 and the recombination frequency r between the marker and the major gene locus, then the likelihood can be written as
u;.
L(B)
= P(d,m) = Lfe(d I g)Pe(m I g)Pe(g), g
since d and m are conditionally independent given the 2-locus joint genotype g. The sum in the above formula is over all 2-locus genotypic configurations in the pedigree. Since the joint genotypes and the polygenic values are independent, the likelihood can also be written as
L(B) = L g
1
fe(d I g,a)Pe(m I g)Pe(g)fe(a)da,
a
which is an explicit formula for evaluating the likelihood. The EM algorithm (Dempster et al., 1977) is employed to obtain estimates of parameters, since this is essentially a missing data problem in that both g and a are unobserved (Guo and Thompson, 1992). For example, the EM equation for the recombination frequency r between the trait and marker locus is
r
=
*
=
Ee(Rld,m)
= Ee(H I d,m)'
where H 2::i Hi and R 2::i Ri are the sufficient statistics for the recombination frequency r (Thomas and Cortessis, 1992). The sums are over all parent-offspring triples ofthe pedigree, where Hi is the number (0, 1, or 2) of doubly heterozygous parents in the ith parent-offspring triple, while Ri is the number of recombination events in segregation from parents to offspring. Despite the simplicity of the EM framework, it is very difficult to evaluate these conditional expectations explicitly. The joint distribution Pe(g, a I d, m) of genotypes and polygenic values given the observed data, which is the center piece for evaluating the conditional expectations, is intractable. Therefore, Monte Carlo estimation of these conditional expectations will be obtained instead, using realizations from a Markov chain with the joint conditional distribution as its equilibrium distribution.
28
SHILl LIN
Thompson et al. (1993) applied these methods to a large family which has elevated cholesterol levels. See Elston et al. (1975) for more about the pedigree and data. Estimates from MCMC were very similar to those from a different approximation method (Hasstedt and Cartright, 1979) that is currently being used routinely in the pedigree analysis of mixed models. 5. Sequential imputation and the MODY example. For multilocus linkage analysis, the sequential imputation method of Kong et al. (1994) has been implemented by Irwin et al. (1994) and Irwin (1995). Suppose that there are L loci under consideration. Let dl and gl denote the data and the underlying genotypes at locus I respectively, for 1= 1,2, ... , L. For a given parameter value (), the multilocus linkage likelihood L(()) = Peed), where d = (d 1, d 2,···, dL ), can be estimated by the method of sequential imputation. The basic idea of sequential imputation is to generate independent samples of the genotypes g = (gl, ... , gL) from a distribution P; (g I d) whose relationship to Pe(g I d) will be specified below. These samples can also be used to estimate likelihoods of other parameter values by an appropriately specified weighting scheme. To obtain a realization of g, the method derives the genotypes locus by locus from the appropriate sampling distributions. First gi is drawn from Pe(gl I dd and the predictive weight Wi = Pe(d 1) is computed. Then, for each successive locus I = 2,3, ... , L, gi is drawn from Pe (gl I d1, ... , dl , gi , ... , gi-1) and the accumulated predictive weight WI = wl-1Pe(dl I d1,···,dl- 1,gL···,gi_l) is computed. Note that the joint sampling distribution for g* = (gi, ... , gjJ is
P;(g I d)
Pe(gll ddI1F=2Pe(gll d1,···,dl,gl,···,gl-d w- 1Pe(d)Pe(g I d),
where W = WL = Pe(dd I1F=2 Pe(dl I d1,···, dl- 1, gl,''', gl-l). Consequently, averaging over g using P;(g I d) we obtain
Ep;(w) It follows that L( ())
Pe(g I d)
= E P;(g I d)Pe(d) = Peed).
= Pe (d) can be estimated by A
L(())
1
N
= Nt; wei),
where w(l),· .. , weN) are the accumulated weights of N independent realizations g(l),···, g(N) of g. In fact, the whole likelihood curve can be estimated via importance sampling from a set of such realizations based on a single parameter ()o. For instance, letting ()1 be any other parameter value other than ()o, then
MONTE CARLO METHODS IN GENETIC ANALYSIS
29
provides an unbiased estimate for L(Bt). However, one should note that L( Bt)would provide a good estimate only if Bl is close to Bo. The MODY example A pedigree which was diagnosed to segregate Maturity Onset Diabetes of the Young (MODY) was used as an example by Irwin (1995) to demonstrate the method. See Irwin (1995) for a diagram of the ISS-member simple pedigree and Bell et al. (1991) for a detailed description of the data. A multipoint linkage analysis was performed to study the location of the MODY gene relative to the eight markers on chromosome 20. An estimated lod score curve was obtained by Irwin (1995) and is shown as Figure 5.1. The x-axis plots the distances in centimorgans while the y-axis
r-...
;1'"
~
~
-
CD CD
o
10
20
30
40
50
Distance (centJmorgans) FIG. 5.1. Nine-point lod score curve obtained by the method of sequential imputation
for the MODY trait. (Figure 4.3 from Irwin (1995))
plots the lod scores. Exact computation of the likelihoods would have been impossible due to the large number of loci involved. The method of sequential imputation is feasible because one is never processing more than one locus at a time. However, in some cases, the sequential imputation computations are also impossible. The computations required for drawing realizations from are performed by the recursive algorithm of Elston and Stewart (1971) which, as discussed in earlier sections, has computational difficulties if the pedigree is complex with many loops. Therefore, although the sequential imputation method has been demonstrated to be feasible and successful
P;
30
SHILl LIN
for this large simple pedigree, it may fail to provide a practicable solution when data come from more complex pedigrees. 6. SmIle specific technical issues 6.1. Finding a starting configuration. The convergence and ergodic theorems guarantee that appropriate probability estimates from the Markov chain realizations converge to the true probabilities, regardless of the starting state, as long as the Markov chain is aperiodic and irreducible. However, convergence can be very slow unless the starting point is chosen appropriately. Thompson (1994a) and Gelman and Rubin (1992) provided examples which illustrate that a Markov chain can "get stuck" at a local mode which has negligible support from the data. Since good estimates depend on thorough exploration of the state space, a Markov chain starting from a poor initial state may provide poor probability estimates within a given amount of computing time. Therefore, for applications of MCMC methods, it is of practical importance that the Markov chain starts from a "good" state, not just any state with positive probability. Ideally, one would want to start from a state with high probability from the equilibrium distribution. For pedigree data, however, even just finding a "legal" state of genotypes, i.e. genotypes consistent with the observed phenotypic data, is difficult for a multiallelic genetic system. This is because of the constraint imposed by the first law of Mendelian inheritance (Mendel, 1865), and the fact that phenotypic data are usually missing for several upper generations. One approach to finding an initial starting genotypic configuration would be the method of gene-dropping described in section 2 above. This gene-dropping process would be repeated until an outcome consistent with the observed phenotypes is resulted. However, the process might have to be repeated for millions of times, even for pedigrees of moderate size, because in all but very small pedigrees it is virtually impossible to obtain samples which are compatible with the observed phenotypes. The method of Sheehan and Thomas (1993) offers another approach. With modified penetrances, it is guaranteed that the Markov chain will eventually find a legal state. In practice, this method may not find a legal state for quite a large number of scans, especially when the pedigree is large and the genetic system is highly polymorphic. Therefore, Wang and Thomas (1994) proposed a modification to the method. Instead of beginning with an arbitrary configuration of genotypes, they described a method to find a more "likely" genotypic configuration to start the search for a legal one. They first assigned founder genotypes by sampling only from the set of genes that were present among their descendants but had not assigned to their spouses. They then assigned genotypes to non-founders conditional on the parents' genotypes and on the genotypes among their descendants. The following describes a deterministic algorithm for finding a proba-
MONTE CARLO METHODS IN GENETIC ANALYSIS
31
ble starting configuration of genotypes. Individuals in the pedigree whose genotypes can be determined unequivocally from the phenotypes are assigned first. Then genotypes are assigned to the rest of the individuals in the pedigree backward in time, with the last generation processed first and the founders last. When assigning a genotype to an individual, it is made certain that the genotype assigned is consistent with his/her spouses and children's genotypes (including other children of his/her spouses), and with his/her parents and sibling's genotypes (including half-sibs). This algorithm produces valid genotype assignments for pedigrees that we have encountered in medical genetics studies. However, artificial counter examples exist. When an illegal genotypic configuration does result, the algorithm needs to be fine-tuned and more care must be taken to reassign genotypes. Several examples have demonstrated that starting configurations found using this algorithm can be much more probable. Such a state is usually a better place to start a Markov chain, to avoid being trapped in a low probability region.
6.2. Multiallelic locus and irreducibility. General HastingsMetropolis algorithms do not guarantee that the constructed Markov chains are ergodic, a necessary condition for inferences from the realizations. Ergodicity needs to be checked for each individual specific problem. In many areas of MeMe applications, ergodicity is not a problem, but it can be in genetic applications. It has been proved that Markov chains constructed from the Gibbs sampler are irreducible for most traits associated with two alleles (Sheehan and Thomas, 1993). However, for a locus with at least three alleles, examples exist where the Markov chains associated with the Gibbs sampler are not irreducible (Lin et al., 1993). The limitation to diallelic loci is a major problem, especially in linkage analysis, because multi allelic marker loci are much more informative than diallelic loci and hence preferred. For MeMe methods to be useful for linkage analysis, irreducibility for multiple alleles must be achieved to ensure validity of results. Reducibility ofthe Gibbs sampler applied to pedigree data results from the strong constraints on the joint genotypes of neighboring individuals in a pedigree. Many components of segregation and penetrance are O. By updating only one individual at a time, part of the genotypic configuration space may never be sampled. The state space is then divided into several communicating classes. States in different classes do not communicate. As a consequence, the ergodic theorem does not hold, and any inference made from the samples is thus invalid. Several methods have been proposed to solve this problem. Sheehan and Thomas (1993) proposed an importance sampling method. A small positive probability p is assigned to all zero penetrance probabilities or to all zero transmission probabilities, so that transition between states in different classes can be realized via "illegal" states introduced by the
32
SHILl LIN
relaxation parameter p. Although in principle this circumvents the problem of reducibility, the practicality of the method raises some questions. There is an obvious trade-off between the size of p and efficiency of the algorithm (Sheehan and Thomas, 1993; Gilks et al., 1993). Lin et al. (1993) showed that irreducibility for the Gibbs sampling Markov chain is achieved by assigning a small positive probability to all zero penetrances with heterozygote genotypes only. They further proved, without identifying all the communicating classes, that these penetrances are the minimum set of probabilities that need to be modified to ensure that states in different classes communicate. The so constructed irreducible chain is then coupled with the original Gibbs sampling chain to form a new integrated process. By switching between chains after every scan with a suitable probability, the correct limiting distribution is preserved. Estimates of the desired probabilities and expectations are obtained using realizations from the distribution of interest, whereas the auxiliary chain only serves to facilitate such simulations from the "right" distribution. This is in contrast to importance sampling methods in which realizations are simulated from the "wrong" distribution and then reweighted. Although the method of Lin et al. (1993) was shown to work well for a triallelic data set from a large complex pedigree, it is unlikely that good results will still be obtainable with highly polymorphic loci. From an example in Lin et al. (1993), it becomes quite clear that, in order to have a more efficient algorithm, one needs to identify the communicating classes explicitly. This task was undertaken by Lin et al. (1994b). They noted that it was observed data on children that were responsible for creating noncommunicating alternatives for unobserved parents. Hence, it was possible to search for communicating classes by looking at each nuclear family successively, from the bottom of a pedigree, tracing up. This lays the basis for the work of Lin (1995) who proposes a new scheme for constructing an irreducible chain by "jumping" from one communicating class to another directly without the need of stepping through illegal configurations. Every realization can be used for making inferences. Furthermore, switching from one communicating class to another is much more frequent. This leads to better sampling of the space of genotypic configurations and hence provides much more accurate probability estimates, compared to other methods for the same amount of computing time. For the pedigree considered in Lin (1995), it took only 1/30 of the time needed for the method of Sheehan and Thomas (1993) to achieve the same degree of accuracy. For larger pedigrees, such as the Alzheimer pedigree considered in Lin et al. (1993) and the hypercholesterolemia pedigree considered in Thompson et al. (1993), the method achieved even better results. 6.3. Multimodality and more efficient samplers. The Gibbs sampler is often chosen as an MCMC algorithm for sampling the space of genotypes because of its simplicity: the conditional genotype distribu-
MONTE CARLO METHODS IN GENETIC ANALYSIS
33
tion of an individual depends only on the phenotype and genotypes of the neighbors. More importantly, the Gibbs sampler avoids problems caused by sparsity of the genotypic configuration space. MCMC algorithms that make changes to several individuals simultaneously are much harder to implement due to the zeros imposed by Mendelian segregation and the difficulty in computing the requisite ratios. However, the Gibbs sampler can be very slow to sample the space of genotypes. If the equilibrium distribution is multimodal, the sampler may remain near a local mode for a long time. It is often quite informative to run a few chains from different starting points, but any formal conclusion will be impossible, as there is no framework for combining results from multiple runs. Even if it were possible to identify all the local modes and then start a chain from each local mode, we still would not know how to combine the results since we would not know the weight for each mode (Geyer, 1992; Raftery and Lewis, 1992). We therefore need more efficient algorithms than the Gibbs sampler to adequately sample the space. Although multimodality is one of the major general problems facing MCMC exploration of a probability surface, algorithms which are efficient for one particular applications may not be advantageous for others, see e.g. Besag and Green (1993). Hence it is clear that more efficient algorithms specifically tailored to genetic applications should be designed. We need an algorithm which will facilitate movement from one local mode to another. Unless one can design an algorithm which jumps between modes, such transitions can only be realized by stepping through low probability states between modes. Therefore any such algorithm must allow the Markov chain to stay at low probability states long enough to move to another mode, rather than moving back to the original mode. This idea leads to the construction of the heated-Metropolis algorithm proposed by Lin et al., (1994a). The easily computed local conditional distributions of the Gibbs sampler are raised to the power ~, where T :::: 1 is a parameter known as "temperature". This modified local conditional distribution is used as the proposal distribution of a Metropolis-Hastings algorithm. It has been successfully applied to estimate carrier probabilities on the Hutterite pedigree described earlier.
6.4. Order of loci and other issues in sequential imputation Efficiency of the estimates obtained from the method of sequential imputation depends on the order in which the loci are processed in the imputation procedure. Since the Monte Carlo estimate for the multilocus likelihood is the average of the accumulated weights over a collection of imputations, the best order of loci is the one that minimizes the variance of the accumulated weight. Note that, at each step of imputation, the sampling distribution is conditional not only on the observed data, but also on any previously imputed values. Therefore, intuitively, one would like to order the marker
34
SHILl LIN
loci according to the amount of data available at each locus. That is, locus with most individuals whose genotypes are typed is processed first, while the least typed locus should be processed last. For two loci with about the same number of individuals typed, the more informative one, i.e. the one with more alleles should be processed ahead of the other one. The goal of this simple rule is to utilize information available as much as possible to reduce the variance of the estimate. This is however only a rule of thumb, and therefore it does not guarantee that the best ordering will result. This rule of thumb also ignores the importance of who are typed as opposed to just the number of individuals typed. For mapping a disease gene against a set of known genetic makers, the disease locus can be processed either first or last in the sequential imputation procedure. For the MODY example in section 5, the disease gene was processed last. This allows calculation of likelihoods at various locations with a single collection of marker imputations. However, as we point out in section 5, the likelihood estimate is unlikely to be accurate unless the sampling distribution is close to the target distribution. The alternative strategy of processing the disease gene first may work better, when the disease status are known for many individuals in the upper generations of the pedigree while their marker genotypes are unknown. Details can be found in Irwin et aI. (1994). For the algorithm described in section 5, genotypes are generated one locus at a time. In particular, gi is sampled from the distribution PO(gl I dI), where d 1 is the observed data at the corresponding locus. However, as long as it is possible to sample from the distribution, d1 should include observed data from as many loci as possible to achieve more efficient estimates (Irwin et aI., 1994). It should be pointed out here again that sampling from PO(gl I d 1) requires computations using the recursive algorithm of Elston and Stewart (1971), which may be impractical when data from more than one locus are involved. 7. Concluding remarks. Markov chain Monte Carlo has been shown to be a powerful technique for estimating probabilities and likelihoods in genetic analysis, when exact computations are not feasible. It is applicable to many different types of problems, illustrated in the paper through three such applications. Although the fundamental theory of MCMC is simple, finding a suitable algorithm to ensure efficient results can be very difficult. Some of the technical problems associated with MCMC are common to many areas of applications. Some however are unique to problems from genetic analysis with complex pedigree structures and data. The foremost issue is to ensure irreducibility of the Markov chain. Although this is almost always satisfied for problems from many applications, it is often not the case with data arising from pedigrees. It should be emphasized that, if irreducibility is violated, then any inference from such realizations is invalid, no matter how long the process is run. This problem
MONTE CARLO METHODS IN GENETIC ANALYSIS
35
is not solved by running multiple processes from several starting points either. Among various solutions proposed, the method of Lin (1995), which jumps directly between communicating classes, seems to be quite promising. Efficient results have been obtained for several problems considered. However, there are always more difficult problems which would defeat the method. Solutions will have to be invented to meet new challenges. The method of sequential imputation has been shown to be a successful technique for estimating likelihoods for multilocus linkage analysis. However, the method may not be applicable to other genetic pedigree analysis problems where other factors of complexity are involved, such as complex traits and complex pedigrees. MeMe and sequential imputation may be viewed as complementary techniques to one another. Whereas the method of sequential imputation may be more efficient in multipoint computations with simple traits and simple pedigrees, MeMe is more suitable for complex traits and pedigrees with many loops. Acknowledgment. I am grateful to Professor Terry Speed for helpful comments on earlier versions of this manuscript, to Dr. Mark Irwin for permission to use Figure 5.1 and comments on the manuscript, and to Dr. Ellen Wijsman for computing the exact lod scores for Figure 4.2. This work is supported in part by NIH grant ROI HG01093-01. REFERENCES Bell, G. I., Xiang, K. S., Newman, M. V., Wu, S. H., Wright, L. G., Fajans, S. S., Spielman, R. S., and Cox, N. J. (1991) Gene for the non-insulin-dependent diabetes mellitus (Maturity Onset Diabetes of the Young) is linked to DNA polymorphism on human chromosome 20q. Proc. Natl. Acad. Sci. USA 88, 1484-1488. Besag, J. and Green, P. J. (1993) Spatial statistics and Bayesian computation (with discussion). J. Roy. Statist. Soc. B 55, 25-37. Boehnke, M. (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am. J. Hum. Genet. 39, 513-527. Cannings, C., Thompson, E. A., and Skolnick, M. H. (1978) Probability functions on complex pedigrees. Adv. Appl. Prob. 10, 26-61. Cottingham, R. W. Jr., Idury, R. M., and Schaffer, A. A. (1993) Faster sequential genetic linkage computations. Am. J. Hum. Genet. 53, 252-263. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B 39, 1-38. Dwarkadas, S., Schaffer, A. A., Cottingham, R. W. Jr., Cox, A. L., Keleher, P., and Zwaenepoel, W. (1994) Parallelization of general-linkage analysis problems. Hum. Hered. 44, 127-141. Easton, D. F., Bishop, D. T., Ford, D., Crockford, G. P., and the Breast Cancer Linkage Consortium (1993) Genetic Linkage Analysis in familial breast and ovarian cancer: results from 214 families. Am. J. Hum. Genet. 52, 678-701. Elston, R. C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered. 21, 523-542. Elston, R. C., Namboodiri, K. K., Glueck, C. J., Fallat, R, Tsang, R and Leuba, V. (1975) Study of the genetic transmission of hypercholesterolemia and hypertriglyceridemia in a 195 member kindred. Am. J. Hum. Genet. 39, 67-83. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling-based approaches to calculating marginal densities. J. Am. Statist. Assoc. 85,398-409.
36
SHILl LIN
Gelman, A. and Rubin, D. (1992) Inference from interative simulation using multiple sequences. Statist. Sci. 7:457-472. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6, 721-74l. Geyer, C. J. (1992) A practical guide to Markov chain Monte Carlo. Statist. Sci. 7, No.4, 473-483. Gilks, W. R., Clayton, D. G., Spiegelhalter, D. J., Best, N. G., McBeil, A. J., Sharples, L. D. and Kirby, A. J. (1993) Modelling complexity: Applications of Gibbs sampler in medicine (with discussion). J. Roy. Statist. Soc. B 55, 39-52. Goradia, T. M., Lange, K., Miller, P. L., Naskarni, P. M. (1992) Fast computation of genetic likelihoods on human pedigree data. Hum. Hered. 42, 42-62. Guo, S. and Thompson, E. (1992) A Monte Carlo method for combined segregation and linkage analysis. Am. J. Hum. Genet. 51, 1111-1126. Hammersley, J. M. and Handscomb, D. C. (1964) Monte Carlo methods. John Wiley & Sons Inc., New York. Hasstedt, S. J. and Cartwright, P. (1979) PAP - Pedigree Analysis Package. Technical Report 13, Department of Medical Biophysics and Computing, University of Utah, Salt Lake City, Utah. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97-109. Henderson, C. R. (1976) A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32, 69-83. Irwin, M., Cox, N., and Kong, A. (1994) Sequential imputation for multilocus linkage analysis. Proc. Nail. Acad. Sci. USA 91, 11684-11688. Irwin, M. (1995) Sequential imputation and multilocus linkage analysis. Ph. D. Thesis, Department of Statistics, University of Chicago, Chicago, IL. Kong, A., Liu, J., and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. J. Am. Statist. Assoc. 89, 278-288. Lange, K., and Elston, R. C. (1975) Extensions to pedigree analysis: likelihood computations for simple and complex pedigrees. Hum. Hered. 25, 95-105. Lange, K., and Boehnke, M. (1983) Extensions to pedigree analysis. V. Optimal calculation of Mendelian likelihoods. Hum. Hered. 33, 291-30l. Lange, K., and Matthysse, S. (1989) Simulation of pedigree genotypes by random walks. Am. J. Hum. Genet. 45, 959-970. Lange, K., and Sobel, E. (1991) A random walk method for computing genetic location sores. Am. J. Hum. Genet. 49, 1320-1334. Lathrop, G. M., Lalouel, J. M., Julier, C., and Ott, J. (1984) Strategies for multilocs linkage analysis in humans. Proc. Nail. Acad. Sci. USA 81, 3443-3446. Lin, S. (1993) Markov chain Monte Carlo estimates of probabilities on complex structures. Ph.D. Thesis, Department of Statistics, University of Washington, Seattle, WA. Lin, S., Thompson, E., and Wijsman, E. (1993) Achieving irreducibility of the Markov chain Monte Carlo method applied to pedigree data. IMA J. Math. Appl. Med. Bioi. 10, 1-17. Lin, S., Thompson, E., and Wijsman, E. (1994a) An Algorithm for Monte Carlo Estimation of Genotype Probabilities on Complex Pedigrees. Ann. Hum. Genet. 58, 343-357. Lin, S., Thompson, E., and Wijsman, E. (1994b) Finding noncommunicating sets for Markov chain Monte Carlo estimations on pedigrees. Am. J. Hum. Genet. 54, 695-704. Lin, S., and Wijsman E. (1994) Monte Carlo multipoint linkage analysis. Am. J. Hum. Genet. 55, A40. Lin, S. (1995) A scheme for constructing an irreducible Markov chain for pedigree data. Biometrics, 51, 318-322. MacCluer, J. W., Vandeburg, J. L., Read, B. and Ryder, O. A. (1986) Pedigree analysis
MONTE CARLO METHODS IN GENETIC ANALYSIS
37
by computer simulation. Zoo BioI. 5, 149-160. Mendel, G. (1865) Experiments in Plant Hybridisation. Mendel's original paper in English translation, with a commentary by R. A. Fisher. Oliver and Boyd, Edinburgh, 1965. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092. Miller, P. L., Nadkarni, P., Gelernter, J. E., Carriero, N., Pakstis, A. J., and Kidd, K. K. (1991) Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology. Compo Biomed. Res. 24, 234-248. Murray, J. C., Buetow, K. H., Weber, J. L., Ludwigson, S., Scherpier-Heddema, T., Manion, F., Quillen, J., Sheffield, V. C., Sunden, S., Duyk, G. M., Weissenbach, J., Gyapay, G., Dib, C., Morrissette, J., Lathrop, G. M., Vignal, A., White, R., Matsunami, N., Gerken, S., Melis, R., Albertsen, H., Plaetke, R., Odelberg, S., Ward, D., Dausset, J., Cohen, D., and Cann, H. (1994) A comprehensive human linkage map with centimorgan density. Science 265, 2049-2064. Ott, J. (1974) Computer simulation in human linkage analysis. Am. J. Hum. Genet. 26, 64A. Ott, J. (1979) Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees. Am. J. Hum. Genet. 31, 161-175. Ott, J. (1989) Computer-simulation methods in human linkage analysis. Proc. Natl. Acad. Sci. USA (Genetics) 86, 4175-4178. Ott, J. (1991) Analysis of Human Genetic Linkage. The Johns Hopkins University Press, Baltimore, MD. Palmer, S. E., Dale, D. C., Livingston, R. J., Wijsman, E. M., and Stephens, K. (1994) Autosomal dominant hematopoiesis: exclusion of linkage to the major hematopoietic regulatory gene cluster on chromosome 5. Hum. Genet. 93, 195-197. Ploughman, L. M. and Boehnke M. (1989) Estimation of the power of a proposed linkage study for a complex genetic trait. Am. J. Hum. Genet. 44, 543-55l. Raftery, A. and Lewis, S. (1992) How many iterations in the Gibbs sampler? In Bayesian Statistics 4 (eds. J. M. Bernardo, J. Berger, A. P. Dawid and A. F. M. Smith), 765-776. Rikvold, P. A., and Gorman, B. M. (1994) Recent results on the decay of metastable phases. Technical report 64, Supercomputer Computations Research Institute, Florida State University, Tallahassee, Florida. Schaffer, A. A., Gupta, S. K., Shriram, K., and Cottingham, R. W. Jr. (1994) Avoiding recomputation in linkage analysis. Hum. Hered. 44, 225-237. Schellenberg, G. D., Pericak-Vance, M. A., Wijsman, E. M., Boehnke, M., Moore, D. K., Gaskell, P. C. Jr., Yamaoka, L. A. et al (1990) Genetic analysis of familial Alzheimer's disease using chromosome 21 markers. Neurobiol. Aging 11:320. Schellenberg, G. D., Bird, T. D., Wijsman, E. M., Orr, H. T., Anderson, L., Nemens, E., White, J. A., Bonnycastle, L., Weber, J. L., Alonso, M. E., Potter, H., Heston, L. L., and Martin, G. M. (1992) Genetic linkage evidence for a familial Alzheimer's disease locus on chromosome 14. Science 258, 668-67l. Sheehan, N. (1990) Genetic reconstruction on pedigrees. Ph. D. Thesis, Department of Statistics, University of Washington, Seattle, WA. Sheehan, N. and Thomas, A. (1993) On the irreducibility of a Markov chain defined on a space of genotype configurations by a sampling scheme. Biometrics 49, 163-175. Smith, A. F. M. and Roberts, G. O. (1993) Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. B 55, 3-23. Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior distributions by data augmentation (with discussion). J. Am. Statist. Assoc. 82, 528-550. Thomas, D. C., Cortessis, V. (1992) A Gibbs sampling approach to linkage analysis. Hum. Hered. 42,63-76. Thompson, E. A. and Guo, S-W (1991) Evaluation of likelihood ratios for complex genetic models. IMA J. Math. Appl. Med. Bioi. 8, 149-169.
38
SHILl LIN
Thompson, E., Lin, S., Olshen, A., and Wijsman, E. (1993) Monte Carlo analysis of a large hypercholesterolemia pedigree. Genet. Epidemiol. 10, 677-682. Thompson, E. A. (1994a) Monte Carlo likelihood in the genetic analysis of complex traits. Phil. Trans. Roy. Soc. London Ser. B, 344, 345-351. Thompson, E. A. (1994b) Monte Carlo likelihood in genetic mapping. Statist. Sci. 9, 355-366. Wang, S. J., and Thomas, D. (1994) A Gibbs sampling approach to linkage analysis with multiple polymorphic markers. Technical report 85, Department of Preventive Medicine, University of Southern California, Los Angeles. Wright, S and McPhee, HC (1925) An approximate methods of calculating coefficients of inbreeding and relationship from livestock pedigrees. J. Agricul. Res. 31, 377-383.
INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING BRONYA KEATS·
The Human Genome Project has had a major impact on genetic research over the past five years. The number of mapped genes is now over 3,000 compared with approximately 1,600 in 1989 (Human Gene Mapping 10, [5]) and only about 260 ten years before that (Human Gene Mapping 5, [4]). The realization that extensive variation could be detected in anonymous DNA segments (Botstein et al. [1]) greatly enhanced the potential for mapping by linkage analysis. Previously, linkage studies had depended on polymorphisms that could be detected in red blood cell antigens, proteins (revealed by electrophoresis and isoelectric focusing), and cytogenetic heteromorphisms. The identification of thousands of polymorphic DNA markers throughout the human genome has led to the construction of high density genetic linkage maps. These maps provide the data necessary to test hypotheses concerning differences in recombination rates and levels of interference. They are also important for disease gene mapping because the existence of these genes must be inferred from the phenotype. Showing linkage of a disease gene to a DNA marker is the first step towards isolating the disease gene, determining its protein product, and developing effective therapies. However, interpretation of results is not always straightforward. Factors such as etiological heterogeneity and undetected irregular segregation can lead to confusing linkage results and incorrect conclusions about the locations of disease genes. This paper will discuss these phenomena and present examples that illustrate the problems, as well as approaches to dealing with them.
Genetic markers. Any detectable variation provides a potential marker for linkage analysis. Several different types of DN A polymorphisms have been developed. Those that are easy to detect and have high heterozygosity (1 - I:Pi, where Pi is the frequency of the i-th allele) are preferred, and many such markers have been placed on genetic linkage maps. This endeavor has been helped by the Centre D'Etude Polymorphisme Humain (CEPH) collaboration, in which many markers have been typed using the same set of families in different laboratories (Dausset et al. [3]). The majority of DNA markers used for linkage studies are short tandem repeat polymorphisms (STRPs) or microsatellites (Weber and May, [21]). They are very short repeated sequences, usually 2-4 base pairs. The variation in number of repeats is easily detected by first using the polymerase chain reaction (PCR) with appropriate primers to amplify the rel• Department of Biometry and Genetics, and Center for Molecular and Human Genetics, Louisiana State University Medical Center, New Orleans, Louisiana 70112. 39
40
BRONYA KEATS
evant piece of DNA and then separating the fragments by electrophoresis on polyacrylamide sequencing gels. Bands are generally visualized by autoradiography or fluorescence. Most STRPs on linkage maps have much higher heterozygosities than another type of DNA marker, the Restriction Fragment Length Polymorphism (RFLP). Detection of an RFLP requires Southern blotting and hybridization to a cloned DNA probe after digestion of genomic DNA with a restriction endonuclease. In addition to having higher heterozygosities than RFLPs, STRPs are far less time consuming to genotype and are much more abundant in the genome. Variable number of tandem repeat (VNTR) markers or minisatellites, are detected in the same way as RFLPs, but the variation is a result of differences in the number of times a sequence is repeated between two restriction sites. They have high heterozygosities but are found far less often than STRPs and tend to congregate near the telomeres. Genetic linkage map. Both the physical map and the genetic linkage map must have the same order. Distances on the two maps, however, are not closely proportional and male genetic distance differs from that in females. Distance on the physical map is measured in base pairs while genetic distance is a function of meiotic recombination rate. Genetic map distances are additive; recombination fractions are not. The genetic map distance between markers is measured in terms of the average number of crossovers per chromatid that occur between them. The unit of genetic distance is the Morgan, one Morgan being the interval that yields an average of one crossover per chromatid. As each crossover event involves two chromatids and there are four chromatids present during meiosis when crossing over occurs, an average of one crossover per chromatid is equivalent to an average of two chiasmata. Thus, genetic distance is equal to half the mean number of chiasmata occurring between two markers. Genetic distance may also be given in centiMorgans (cM): 1 Morgan = 100 cM. If the genetic length of a chromosome is 2 Morgans, then an average of two crossovers per chromatid or four chiasmata occur on this chromosome. In males approximately 53 chiasmata per cell are observed cytogenetically. Therefore, male genetic length is about 26.5 Morgans. Although genetic distance is not proportional to physical distance, in general, the longer the physical length, the longer the genetic length. The total human haploid genome is approximately 3 x 10 9 base pairs, and the total sex-averaged genetic length is estimated to be about 33 Morgans. Thus, on average, one centiMorgan is equivalent to about a million base pairs, although this correspondence varies throughout the genome; there are both short physical segments with high recombination rates and long segments with low recombination rates. For example, chromosome 19 is one of the shortest chromosomes with a physical length of only about 62 megabases, while its male genetic length is 114 cM and its female genetic length is 128 cM (Weber et al. [22]). Thus, for this chromosome, one centiMorgan is equivalent to 500,000 base pairs.
INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING 41
Keats et al. [9] presented guidelines for genetic linkage maps. The linkage map is constructed by statistical analysis and the logarithm of the likelihood ratio, 10glO(Ll/ L 2 ), is generally used to measure support. A map consisting of markers for which order is well-supported is called a framework map. At least three measures of support are of relevance in building a linkage map. Global support is the evidence that a marker belongs to a linkage group; it is calculated by setting Ll as the maximum likelihood when the marker is placed in the linkage group and L2 as the likelihood when the marker has free recombination with the linkage group. Interval support provides the evidence that a marker is in a specified order relative to a set of framework markers. In this case, Ll is the likelihood under the given order and L2 is the highest likelihood obtained by placing the marker in any other interval on the framework map. Support for order of a set of markers is calculated by taking Ll as the likelihood under the favored order and L2 as the likelihood under a competing order. For each of these measures of support a value of at least 3 is recommended. Accurate genotyping is essential for the construction of linkage maps. Even a low error rate can substantially inflate map length (Buetow [2]), and typing errors may sometimes lead to incorrect orders. Interference. The phenomenon of interference needs to be considered in the construction of linkage maps. Recombination frequencies are not additive because multiple crossovers may occur between markers. An offspring is a nonrecombinant if an even number of crossovers occurs between two markers, and a recombinant if an odd number of crossovers occurs between the two markers. In addition, crossing over in one region interferes with crossing over in a neighboring region. Two types of interference may be differentiated: chiasma interference and chromatid interference. Chiasma interference is the influence of an already formed chiasma on the formation of a new one. If the interference is positive, then a second chiasma is less likely to occur, and if it is negative, a second chiasma is more likely to occur than would be expected by chance. Chromatid interference is the departure from random occurrence of any of the four strands in the formation of chiasmata. It is difficult to detect and good evidence that it exists has not yet been found. Under complete interference the genetic map distance is equal to the recombination fraction and the distance between two markers is at most 50 centiMorgans. Assuming no interference simplifies calculations but it leads to considerable overestimation of map distances. Weber et al. [22] obtained strong evidence for chiasma interference on chromosome 19. Although they made a number of simplifying assumptions, the observed number of double recombinants was significantly lower than that expected if there is no interference. Sex heterogeneity. Male and female estimates of the recombination fraction are different for many regions of the genome. Thus, male and female linkage maps need to be estimated separately, constrained by order.
42
BRONYA KEATS TABLE 1
Male and female genetic distances in telomeric (short arm and long arm) and centromeric regions of chromosome 19.
Marker
Location
Distance (cM) Male Female
D19S20 short arm
28.8
7.1
centomeric
2.4
11.3
long arm
27.4
6.8
D19S247 D19S199 D19S49 D19S180 D19S254
Overall, female genetic length is longer than male genetic length, but the ratio varies with position on the chromosome. For some chromosomes there appears to be an excess of female recombination near the centromere and an excess of male recombination near the telomeres, but these relationships are not yet known precisely. Table 1 shows male and female map distances for regions of chromosome 19 near the centromere and near the telomeres of the short arm and the long arm.
Etiological heterogeneity. Linkage studies to map disease genes show that identical clinical phenotypes do not necessarily mean that the disease is caused by a mutation in the same gene in all affected individuals. Morton [14] analysed families with elliptocytosis and showed that the gene causing the disease was linked to the Rhesus blood group on the short arm of chromosome 1 in some families but not in others. This conclusion was based on his finding that there was significant heterogeneity of the recombination fraction among families. Thus, variation in the recombination fraction suggests that genes at more than one chromosomal location may cause the same clinical phenotype. Another example of this heterogeneity is for the neuropathy, Charcot-Marie-Tooth type I, in which patients have very slow nerve conduction velocities. Initial studies suggested linkage to the Duffy blood group on chromosome 1 in a few families but not in others. Additional studies showed that in many of the unlinked families the disease gene was linked to markers on the short arm of chromosome 17 (Vance et al. [20]). Thus heterogeneity of the recombination fraction first indicated that more than one gene may cause this neuropathy, and proof of this was obtained when the location of a second gene for the disease was found. Two further diseases for which several genes cause the same clinical phenotype
INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING 43
DllS861 DllS419 DllS1397 DllS921 DllS1310 DllS899
8 2 1 5 5 7
FIG. 1. Haplotypes for family showing recombination between D11S1397 and DllS921.
are discussed below. They are Usher syndrome type I and spinocerebellar ataxia. Usher Syndrome. Usher syndrome is characterized by hearing impairment, retinitis pigmentosa, and recessive inheritance. Three types are distinguished clinically based on severity and progression of the hearing impairment. Family studies of the three types of Usher syndrome have demonstrated genetic as well as clinical heterogeneity. Three genes for type I have been localized to the short arm of chromosome 11 (Smith et al. [18]), the long arm of chromosome 11 (Kimberling et al. [10]), and the long arm of chromosome 14 (Kaplan et al. [6]). Kimberling et al. [11] and Lewis et al. [12] assigned a gene for type II to chromosome 1, and a gene for type III was recently assigned to chromosome 3 (Sankila et al. [17]). One strategy to reduce the chance that different genes are responsible for Usher syndrome type I in a set of families is to select families from an isolated population such as the Acadians of southwestern Louisiana. According to Rushton [16], about 4,000 Acadians made their way to Louisiana during the second half of the 18th century when the English ordered their expulsion from Acadia (now Nova Scotia and surrounding areas). They settled on the plains among the bayous of southwestern Louisiana and remained relatively isolated because of linguistic, religious, and cultural cohesiveness, as well as geographic isolation. The gene for Usher syndrome type I (USHIC) on the short arm of chromosome 11 has been found only in the Acadian population, and the region containing the disease gene was refined by Keats et al. [7]. Figure 1
44
BRONYA KEATS
15.5 15A 15.3 15.2 15.1
14
--------------DllS861 1
DllS419 1
DllS1397 0.5
DllS921 0.5
DllS1310 1
DllS899 FIG. 2. Map Showing location of the Acadian Usher syndrome type I gene (USHIC).
shows a family in which recombination between the markers DllS1397 and DllS921 is observed in one of the affected offspring. This result provides strong evidence that USHIC is flanked on one side by the marker DllS1397. In order to find a flanking marker on the other side of USH1C, we examined the marker alleles that were inherited with the disease alleles in each affected individual. Table 2 shows that the same DllS921 allele was found on all 54 chromosomes with the disease allele but four of these chromosomes had a different allele for DllS1310. Thus, USHIC is likely to be between DllS1397 and DllS1310. Figure 2 shows the map giving the order of the markers and the distances between them measured in centiMorgans. The region to which we have mapped the gene for Acadian Usher syndrome type I is about 1.2 centiMorgans which is probably less than 1.5 megabases of DNA and we are continuing our efforts to isolate and characterize this disease gene. Spinocerebellar Ataxia. The spinocerebellar ataxias are a heterogeneous group of disorders characterized by lack of coordination of movements due to progressive neurodegeneration in the cerebellum. The age of onset of symptoms is usually between the third and fifth decades, and death occurs 10 to 15 years later. Several different genes that cause dominantly inherited spinocerebellar ataxia have now been localized. Genetic heterogeneity complicates the search for disease genes. Finding a recombination event is critical to defining flanking markers, but the possibility that the disease gene is elsewhere cannot be ignored especially
INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING
45
TABLE 2 Marker alleles associated with the Acadian Usher chromosome.
DllS1397 3 1 3 3 3 3 3 3 3 1
DllS921 4 4 4 4 4 4 4 4 4 4
DllS1310 3 3 3 3 3 3 4 5 4 4 Other
DllSS99 2 2 9 6 4 S 7 6 2 9
Total
Usher 40 1 5 1 1 2 1 1 1 1 0
Non-Usher 1 0 1 0 0 2 0 0 1 1 44
54
50
TABLE 3 Lod scores for SCAl.
Marker HLA D6SS9
0.0 -00
4.9
Recombination Fraction .1 .2 .01 .05 .3 -3.5 -2.2 -1.5 -O.S -0.4 4.S 4.4 3.9 2.S 1.7
.4 -0.2 0.7
if the family is small. On the other hand, results that suggest exclusion of a gene from a region may be misleading. Originally the location of SCA1 (spinocerebellar ataxia type I) on chromosome 6 had been demonstrated through linkage to HLA. Keats et al. [S] reported a family where evidence of linkage to HLA was not obtained and the initial conclusion was that a different gene was responsible for the disease in this family. However, a more tightly linked marker, D6SS9, was found (Zoghbi et al. [23]), and Keats et al. [S] showed that there was no recombination between this marker and the disease gene in their family. Table 3 gives the lod scores with HLA and D6SS9; these two markers are about 15 cM apart on the short arm of chromosome 6. Unusual segregation. Etiological heterogeneity complicates the interpretation of linkage results and is of major concern because it is relatively common. Unusual segregation patterns appear to be less common but when they occur linkage results can be confusing and misleading. Charcot Marie Tooth Disease. Charcot- Marie-Tooth neuropathy is a heterogeneous disease characterized by slowly progressive muscle weakness and atrophy. The most common mode of inheritance is autosomal domi-
46
BRONYA KEATS
11(1,2)
11(1,2)
1/2
112
1/2
11(1,2)
11(1,2)
FIG. 3. Genotypes for the marker D17S122. Individuals with Charcot-Marie Tooth disease (solid squares and circles) have three alleles.
nant and a gene on the short arm of chromosome 17 accounts for the majority of these cases. Vance et al. [20] reported linkage of the disease gene (CMT1A) to markers on chromosome 17. However, the marker, D17S122, gave discrepant results. Based on known map distances this marker should have been tightly linked to CMTIA, but many recombination events were observed. This inconsistent result was resolved when Lupski et al. [13] demonstrated the presence of a duplication. In a large family reported by Nicholson et al. [15] the maximum lod score increased from 0.5 at a recombination fraction of 0.3 to 34.3 at zero recombination after taking the duplication into account. The effect of the duplication on recombination is seen in Figure 3 where the father and all of the offspring would be assigned the genotype 1/2 if the duplication were ignored. In this case at least two of the offspring must be recombinants. When the presence of the third allele is recognized the genotypes are consistent with no recombination. Uniparental Disomy. The phenomenon of uniparental disomy, in which both copies of a chromosome are inherited from one parent, also leads to inconsistent linkage results. This event is relatively rare, but it has been documented in several clinical disorders. For example, Spence et al. [19] showed that it was the cause of a case of the recessively inherited disorder, cystic fibrosis. Rather than inheriting one copy of the defective gene from each parent, both copies came from the mother. Genotyping of chromosome 7 markers showed that the child had two maternal copies of this chromosome and no paternal chromosome 7. Although inconsistencies between father and offspring are almost certain to be found in this situation, some markers are likely to give compatible genotypes and recombination would be assumed to have occurred. For example, if the parental genotypes
INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING
47
at a marker tightly linked to the disease gene are 1/2 and both an affected and an unaffected offspring have the genotype 1/1, then one of the offspring would be assumed to be a recombinant. In fact, however, uniparental disomy may explain the affected individual. Conclusions. The discovery of thousands of highly polymorphic microsatellite markers than span the genome at small intervals has had a huge impact on our understanding of the genetic linkage map. As well as leading to the localization of disease genes, it has provided the tools necessary to study variation in recombination among groups and to examine the phenomenon of interference. Several unexpected results have changed our way of thinking about transmission of alleles from one generation to the next. The research that is resulting from the goals of the Human Genome Project is truly revolutionary and will benefit mankind in many ways.
REFERENCES [1] Botstein, D., White, R. L., Skolnick, M., Davis, R. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet., 32:314-331, 1980. [2] Buetow, K. H., Influence of aberrant observations on high-resolution linkage analysis outcomes. Am. J. Hum. Genet., 49:985-994, 1991. [3] Dausset, J., Cann, H., Cohen, D., et al., Centre d'Etude du Polymorphisme Humain (CEPH): Collaborative genetic mapping of the human genome. Genomics, 6:575-577, 1990. [4] Human Gene Mapping 5: Fifth International Workshop on Human Gene Mapping. Cytogenet. Cell Genet., 25:1-236, 1979. [5] Human Gene Mapping 10: Tenth International Workshop on Human Gene Mapping. Cytogenet. Cell Genet., 51:1-1148, 1989. [6] Kaplan, J., Gerber, S., Bonneau, D., Rozet, J., Delrieu, 0., Briard, M., Dollfus, H., Ghazi, I., Dufier, J., Frezal, J., Munnich, A. A gene for Usher syndrome type I (USH1) maps to chromosome 14q. Genomics, 14:979-988,1992. [7] Keats, B. J. B., Nouri, N., Pelias, M. Z., Deininger, P. L., Litt, M. Tightly linked flanking microsatellite markers for the Usher syndrome type I locus on the short arm of chromosome 11. Am. J. Hum. Genet., 54:681-686, 1994. [8] Keats, B. J. B., Pollack, M. S., McCall, A., Wilensky, M. A., Ward, L. J., Lu, M., Zoghbi, H. Y. Tight linkage of the gene for spinocerebellar ataxia to D6S89 on the short arm of chromosome 6 in a kindred for which close linkage to both HLA and F13A1 is excluded. Am. J. Hum. Genet., 49:972977, 1991. [9] Keats, B. J. B., Sherman, S. L., Morton, N. E., Robson, E. B., Buetow, K. H., Cartwright, P. E., Chakravarti, A., Francke, U., Green, P. P., Ott, J. Guidelines for human linkage maps: An international system for human linkage maps (ISLM 1990). Genomics, 9:557-560, 1991. [10] Kimberling, W. J., Moller, C. G., Davenport, S., Priluck, I. A., Beighton, P. H., Greenberg, J., Reardon, W., Weston, M. D., Kenyon, J. B., Grunkmeyer, J. A., Pieke, Dahl S., Overbeck, L. D., Blackwood, D. J., Brower, A. M., Hoover, D. M., Rowland, P., Smith, R. J. H. Linkage of Usher syndrome type I gene (USH1B) to the long arm of chromosome 11. Genomics, 14:988-994,1992. [11] Kimberling, W.J., Weston, M. D., Moller, C. G., Davenport, S. L. H., Shugart,
48
BRONYA KEATS
[12]
[13]
[14] [15]
[16] [17]
[18]
[19]
[20]
[21] [22]
[23]
Y. Y., Priluck, I. A., Martini, A., Smith, R. J. H. Localization of Usher syndrome type II to chromosome 1q. Genomics, 7:245-249, 1990. Lewis, R. A., Otterud, B., Stauffer, D., Lalouel, J.M., Leppert, M. Mapping recessive ophthalmic diseases: Linkage of the locus for Usher syndrome type II to a DNA marker on chromosome 1q. Genomics, 7:250-256, 1990. Lupski, J. R., Montes, de Oca-Luna R., Slaugenhaupt, S., Pentao, L., Guzzetta, V., Trask, B. J., Saucedo-Cardenas, 0., Barker, D. F., Killian, J. M., Garcia, C. A., Chakravarti, A., Patel, P. 1. DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell, 66:219-232, 1991. Morton, N. E. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. J. Hum. Genet., 8:80-96, 1956. Nicholson, G. A., Kennerson, M. L., Keats, B. J. B., Mesterovic, N., Churcher, W., Barker, D., Ross, D. A. Charcot-Marie-Tooth neuropathy type 1A mutation: Apparent crossovers with D17S122 are due to a duplication. Am. J. Med. Genet., 44:455-460, 1992. Rushton, W. F. The Cajuns: From Acadia to Louisiana. New York: Farrar Straus Giroux, 1979. Sankila, E. M., Pakarinen, L., Sistonen, P., Aittomaki, K., Kaariainen, H., Karjalainen, S., De la Chapelle, A. The existence of Usher syndrome type III proven by assignment of its locus to chromosome 3q by linkage. Am. J. Hum. Genet., (supplement) 55:A15, 1994. Smith, R. J. H., Lee, E. C., Kimberling, W. J., Daiger, S. P., Pelias, M. Z., Keats, B. J. B., Jay, M., Bird, A., Reardon, W., Guest, M., Ayyagari, R., Hejtmancik, J. F. Localization of two genes for Usher syndrome type 1 to chromosome 11. Genomics, 14:995-1002,1992. Spence, J. E., Perciaccante, R. G., Greig, G. M., Willard, H. F., Ledbetter, D. H., Hejtmancik, J. F., Pollack, M. S., O'Brien, W. E., Beaudet, A. L. Uniparental disomy as a mechanism for human genetic disease. Am. J. Hum. Genet., 42:217-226, 1989. Vance, J. M., Nicholson, G. A., Yamaoka, L. S., Stajich, J., Stewart, C. S., Speer, M. C., Hung, W., Roses, A. D., Barker, D., Pericak-Vance, M. A. Linkage of Charcot-Marie-Tooth neuropathy type 1a to chromosome 17. Exp. Neurol., 104:186-189, 1989. Weber, J. L., May, P. M. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44:388-396, 1989. Weber, J. L., Wang, Z., Hansen, K., Stephenson, M., Kappel, C., Salzman, S., Wilkie, P. J., Keats, B. J., Dracopoli, N. C., Brandriff, B. F., Olsen, A. S. Evidence for human meiotic recombination interference obtained through construction of a short tandem repeat polymorphism linkage map of chromosome 19. Am. J. Hum. Genet., 53:1079-1095,1993. Zoghbi, H. Y., Jodice, C., Sandkuijl, L. A., Kwiatkowski, T. J., McCall, A. E., Huntoon, S. A., Lulli, P., Spadaro, M., Litt, M., Cann, H. M., Frontali, M., Luciano, T. The gene for autosomal dominant spinocerebellar ataxia (SCA1) maps telomeric to the HLA complex and is closely linked to the D6S89 locus in three large kindreds. Am. J. Hum. Genet., 49:23-30, 1991.
ESTIMATING CROSSOVER FREQUENCIES AND TESTING FOR NUMERICAL INTERFERENCE WITH HIGHLY POLYMORPHIC MARKERS JURG OTT· Abstract. Interference may be viewed as having two aspects, numerical interference referring to the numbers of crossovers occurring, and positional interference referring to the positions of crossovers. Here, the focus is on numerical interference and on methods of testing for its presence. A dense map of highly polymorphic markers is assumed so that each crossover can be observed. General relationships are worked out between crossover distributions and underlying chiasma distributions. It is shown that crossover distributions may be invalid, and methods are developed to estimate valid crossover distributions from observed counts of crossovers. Based on valid estimates of crossover distributions, tests for interference and development of empirical map functions are outlined. The methods are applied to published data on human chromosomes 9 and 19.
1. Introduction. Below, standard genetic terminology is used. To avoid confusion, the following definitions are provided: Chiasma refers to the cytologically observable phenomenon that in meiosis the two homologous chromosomes establish close contact at some point(s) along their lengths. Several such chiasmata per chromosome may occur. Crossing-over (or crossover) is the process of reciprocal exchange between homologous chromosomes in meiosis (Nilsson et al. 1993). On a chromosome received by an individual from one of his parents, blocks of loci originating in one grandparent alternate with blocks of loci from the other grandparent. The switch of grandparental origins is caused by the occurrence of a crossover, which is known to involve one strand (chromatid) from each of the two homologous chromosomes (Mather 1938). In a gamete, the point on a chromosome separating two blocks of loci from different grandparents is called a crossover point or point of exchange. Occurrence of a crossover is believed to be the result of the formation of a chiasma but doubts have been raised whether this 1:1 relationship holds universally (Nilsson et al. 1993). In particular, in plant species, map distance estimates based on chiasma counts were compared with those based on RFLP maps, where the former turned out to be far lower than the latter (Nilsson et al. 1993). On the other hand, as is well known in experimental genetics, crossing-over leads to the formation of the so-called Holliday structure; it may be resolved by a cut of strands in one of two ways, with one cut leading to strands containing a crossover point between two markers on either side of the cut while the other cut does not result in • Department of Genetics and Development, Columbia University, Unit 58, 722 West 168 Street, New York, NY 10032. E-mail: [email protected] 49
50
JURG OTT
a crossover point (Ayala and Kiger 1984). Thus, chiasma frequencies would be expected to be higher than predicted on genetic grounds. At any rate, the material in this chapter addresses only those chiasmata with genetic consequences, that is, chiasmata of which each results in a crossover point on two of the four gametes. Consider two alleles, one at each of two loci, received by an offspring from one of his parents. A recombination is said to have occurred between the two loci if the two alleles originated in different grandparents, whereas a nonrecombination corresponds to allelic origin in the same grandparent. When the two loci reside on different chromosomes, recombination is a random event (occurring with probability ~) due to random inclusion of either of the two chromatids in a gamete. For loci on the same chromosome, occurrence of recombination depends on the number of crossover points occurring between the two loci in a gamete. An odd number of crossover points between two loci in a gamete is seen as a recombination and an even number as a nonrecombination. The average number, d, of crossover points (per gamete) between two loci on a chromosome is defined as the genetic distance (d Morgans, M, or 100d centimorgans, cM) between them. It is equal to one half the average number of chiasmata occurring between the two loci. For example, on chromosome 1 (male map length ~ 2M; Morton 1991), in male meiosis, an average of approximately four chiasmata are formed so that a gamete resulting from such a meiosis carries an average of two crossover points. If an interval is small enough so that at most one crossover occurs in it, each recombination corresponds to a crossover and the recombination fraction coincides with map length of the interval. Interference is defined as dependence in the occurrence of crossovers. Two types of interference are generally distinguished (Mather 1938). Chiasma interference (henceforth simply called interference) refers to the number or position of crossovers, and chromatid interference refers to which chromatids are involved in the chiasma formation. The latter is assumed to be absent in most species. In current human genetics papers, chiasma interference has been referred to under various new names, for example, meiotic recombination interference (Weber et al. 1993) and meiotic crossover interference (Kwiatkowski et al. 1993). When crossovers occur according to a Poisson process, interference is absent. Deviations from the Poisson process can be reflected in the numbers of single and multiple crossovers occurring (here called numerical interference) or in the positions where they occur (positional interference). Interference has been thought to be due to some stearic chromosomal property such as stiffness (Haldane 1919). Further, simply restricting the number of crossovers to some minimum or maximum also implies (numerical) interference. For example, the assumption of an obligatory chiasma with otherwise random occurrence of chiasmata implies interference, which is reflected in the Sturt map function (Sturt 1976). This type of interfer-
NUMERICAL INTERFERENCE IN GENETIC MAPS
51
ence is sometimes considered not being "real" in a biochemical sense as its nature is more statistical than due to interaction among crossover events, which would be reflected in positional interference. Below, criteria will be established for estimating valid crossover distributions. Based on these, tests for detecting numerical interference will be discussed. Much of this book chapter is devoted to theory. Application to published data for chromosomes 9 and 19 are presented in a section towards the end of this chapter. For all derivations it is assumed that each crossover can be observed unless it occurs outside the map of markers considered. This assumption is realistic when a large number of highly polymorphic markers exist on a chromosome such that intervals are so short that the possibility of multiple crossovers in an interval is negligible. 2. Estimating distributions of chiasmata and crossovers. In this section, the statistical relationships between crossover distributions (proportion of gametes carrying a certain number of crossovers on a given chromosome) and chiasma distributions (proportion of meioses showing a certain number of chiasmata on a given chromosome) are explored. Without chromatid interference, as is assumed here throughout, when a chiasma is formed at some location on a chromosome, the probability is ~ that a gamete resulting from the given meiosis will carry a crossover point at the location of the chiasma. Thus, for a given number, c, of chiasmata on a chromosome, the number, k, of crossover points on a gamete follows a Binomial (c, ~) distribution. The distribution of k can be obtained from the distribution of c by N
(2.1)
P(I< = k) =
L P(klc)P(C = c), c=o
where N is the maximum number of crossovers occurring. For finite N, the values of P(klc) form an N x N triangular matrix,
(2.2)
P(klc)
={
m~~r
if k < c if k > c.
This matrix is of full rank and provides for a 1 : 1 mapping between P( k) and P(c). Each P(c) defines a valid unique P(k). The inverse operation, while numerically unique, may lead from a given P(k) to a set of numbers some of which are negative or larger than 1. In other words, there are crossover distributions that do not correspond to a valid chiasma distribution. Such crossover distributions are biologically meaningless and are, thus, invalid. The direct inverse of (2.1) is easily obtained as
(2.3)
P(c=i)=2 i [P(k=i)-.t J=.+l
(~) (~y P(c=i)],
52
JURG OTT
which requires an order of evaluation from the top down, that is, c(N) must be calculated first, then e( N - 1), etc. Direct estimates of crossover distributions are typically obtained as multinomial proportions of numbers of crossovers. For example, if n( k) is the observed number of gametes carrying k crossovers, the crossover distribution P( k) is estimated directly by the proportions, n( k) / I::i n( i), k = 0, 1, ... , N. However, the estimated class proportions may correspond to an invalid associated chiasma distribution in which case these proportions are not maximum likelihood estimates (MLEs). Then, the MLE of a crossover distribution must be obtained by a different procedure. The procedure proposed in the next paragraph first carries out transformation (2.3) on the direct crossover frequency estimates. If necessary, the resulting values of P( e) are then transformed into a valid chiasma distribution, which, in turn, leads to the MLE of the crossover distribution; because of the 1:1 nature of transformation (2.1), the MLE of a chiasma distribution also defines the MLE of the crossover distribution derived from it. A convenient iterative method for obtaining MLEs of crossover distributions works via MLEs of associated valid chiasma distributions. It is based on the following representation of the log likelihood: (2.4) where N is the maximum number of crossovers observed, M(> N) is a suitable upper limit, such as 20, for the number of chiasmata, and the qi = P(e = i), i = 1...M, are the chiasma class probability parameters to be estimated, with qo = 1-ql -q2- ... (the estimates of Pi, i = N + 1, ... , M, are all equal to zero). Taking partial derivatives of (2.4) and setting them equal to zero leads to (2.5)
I::f-o n(k)P(e = 11k)
_
I::f-o n(k)P(e = 21k)
Based on expression (2.5), the algorithm starts with an initial chiasma distribution, for example, qi = l/(M + 1) for all i. Then, for a given class k of the crossover distribution, the conditional chiasma distribution, P(elk), is computed and the observations n( k) probabilistic ally assigned to the chiasma classes, that is, proportional to the P( elk). Once this is done for all k, those portions of the crossover observations assigned to a class, P( e = j), are added and the result divided by the total number of observations, thus obtaining an updated estimate of the chiasma distribution, which completes one iteration. Once MLEs of valid chiasma class probabilities have been obtained, they are transformed by (2.1) into the corresponding crossover class frequencies, which are then valid MLEs. This method has been implemented in a program, CROSSOVR, which is now one of the Linkage Utility Programs (Ott 1991). While the approach
NUMERICAL INTERFERENCE IN GENETIC MAPS
53
Pl 0.75f----.
0.75
FIG. 1.
implemented in CROSSOVR works well and generally fast, occasionally convergence may be slow so that several thousand iterations are necessary to reach an accuracy of, say, 10- 6 for the chiasma class probabilities. For small values of M, it is easy to analytically demonstrate invalidity of crossover distributions. Let Pi = probability of i crossovers, and qi = probability of i chiasmata. Assume, for example, a maximum of M = 1 chiasma (complete interference) on a chromosome. Then, by (2.1), Pi = ! qi, and qi = 2pi. Because of qi ::; 1, one must have Pi ::; !. Whenever an estimate, Pi, exceeds the value!, the associated chiasma probability qi exceeds 1 and is, thus, invalid. Of course, in this case, Pi coincides with the recombination fraction, which is known to be restricted to values up to ! only. The reason that invalid crossover distributions occur is that gametes produced by a parent are sampled at random. With M = 1, when a chiasma has occurred, half of the gametes will carry a crossover and half of them will not. Thus, one might by chance observe too many gametes carrymg a crossover. For M = 2, the chiasma distribution parameters are given by qi = 2(Pi - 2p2) and q2 = 4p2. Restricting each of the qi to the range (0, 1) leads to the conditions 2P2 ::; Pi ::; !. In the (Pi, P2)-plane, as shown in figure 1, the admissible range of values is contained within a triangle 1/8 the surface of the whole parameter space. With small numbers of observations, due to random fluctuations, it will happen relatively frequently that an observed crossover distribution is invalid. The probability that it is valid increases with the number of gametes investigated and with decreasing values of Pi and P2.
54
JURG OTT
3. Obligatory chiasma per chromosome. It is generally assumed that crossing-over is required for proper segregation of the homologous chromosomes in meiosis (Kaback et al. 1992). In all organisms in which recombination normally occurs there seems to be at least one chiasma on each chromosome per meiosis (Baker et al. 1976). As is mentioned in the introduction, this obligatory chiasma is assumed to be resolved such that it has genetic consequences. Presence of an obligatory chiasma is formulated as P(c 0) 0, that is, the zero class in the chiasma distribution is missing. In the iterative algorithm described in connection with (2.5) above, the c = 0 class frequency was estimated along with all other class frequencies. It is easy to implement the requirement, P(c = 0) = 0, in this algorithm. At the end of each iteration cycle, the estimate for P(c = 0) is set equal to zero, and all other class frequencies are adjusted to again sum to 1.
= =
4. Incomplete chromosome coverage. Thus far it has been assumed that a chromosome is densely covered by markers and that a marker resides at each of the two chromosome ends. In reality, the two flanking markers may not extend all the way to the ends of the chromosome so that only a proportion, f < 1, ofthe chromosome will be covered by the marker map. Some of the genetic models discussed below allow for such incomplete chromosome coverage. In the context of chiasma frequency estimation discussed above, incomplete chromosome coverage can only be allowed for with assumptions on chromosomal positions of chiasmata. For example, assume occurrence of at least one chiasma per meiosis. For the case that this is the only chiasma occurring, and under the assumption that it is equally likely to occur anywhere on the chromosome, the probability is f that it will be formed on the marker map, and it will lead to a crossover with probability ~f. Then, the proportion of zero chiasmata in the (valid) chiasma distribution is an estimate for 1- j, and the proportion of gametes without a crossover is an estimate for 1 - ~ f. With multiple chiasmata occurring and some regularity assumptions on where they occur, one finds (details not shown here) that j is approximately estimated by (1 - qo) / E, where qo is the proportion of zero chiasmata on the marker map and E is the mean of the numbers of chiasmata occurring on the entire chromosome. Thus, on longer chromosomes (E > 1), j = 1 - qo is likely to overestimate chromosome coverage. As this chapter is on numerical rather than positional interference, these thoughts are not pursued further. 5. Tests for interference. In this section, the null distribution of crossover numbers under no interference will be compared with the observed numbers of crossovers. Null distributions without and with an obligatory chiasma will be considered. It will be seen that restricting observed crossover distributions to valid estimates tends to reduce evidence for in-
NUMERICAL INTERFERENCE IN GENETIC MAPS
55
terference. Absence of interference implies that the number of chiasmata occurring on a chromosome follows a Poisson distribution with parameter a, its mean. The crossover distribution corresponding to this chiasma distribution, by virtue of (2.1), is also Poisson but with mean b = a/2, which is the genetic length of the chromosome. The number of chiasmata or crossovers occurring on a portion of a chromosome also follow Poisson distributions, with means corresponding to the length of the interval considered. With an obligatory chiasma, under no interference, the number of chiasmata on a chromosome follows a truncated Poisson distribution (c 2: 1) but, as shown below, the corresponding number of crossovers is no longer Poisson. Sturt (1976) developed a map function based on the assumption of an obligatory chiasma. Here, frequency distributions of chiasmata and crossovers are given under this assumption. Two cases will be considered, 1) full coverage of a chromosome by the marker map, and 2) incomplete chromosome coverage. First, the crossover distribution over a whole chromosome (here called the Sturt crossover distribution) is discussed given that an obligatory chiasma occurs on each chromosome. Based on the truncated Poisson distribution (zero chiasma class missing), this crossover distribution can be derived by elementary statistical techniques as follows: (5.1)
for k = 0 for k = 1,2, ...
The mean of (5.1) is obtained as (5.2) where b, the single parameter of the Sturt crossover distribution (5.1), has no simple direct interpretation except that it is a monotonic function of the mean. To obtain the value of the parameter b corresponding to a given mean, the following equation may be executed recursively: b = m(1-e- 2b ), where initially b is set equal to m in the right hand side of this equation. The MLE, b, of b cannot be obtained in closed form but rearranging the likelihood equation leads to the following iterative solution: (5.3) u, where U = L,kk x n(k)/L,kn(k) is the sample mean and Uo is the sample proportion of gametes with zero crossovers; u/2 is a suitable initial value for b in the right side of (5.3). Note that u is not the maximum likelihood estimate of m (5.2). Now, extend this approach to the situation that the marker map only incompletely covers the chromosome. Consider the crossover distribution
o ::; b
i'
1
[1-2M( L: dk)]
75
~o,
k:ik=l
where j ~ i' means jk ~ i~ = 1 - i k , k = 1,···, m. Putting d l = X, d 2 = ... = dm = h, doing some simple manipulation and letting h ! 0, yields the condition
I-2M
where G = and G(r) is the rth derivative of G. Thus G is completely monotone on (0, (0), and one can also show that G(O) = 1, G/(O) = -2. It should be clear that equation (4.3) and its generalizations do indeed facilitate further mathematical development, but are they necessary constraints on a map function? The answer here must be no, and we offer three reasons why. While one must agree with Karlin and Liberman (1994, p. 212) that "it is essential and natural to operate with a general genomic region composed of a union from among the segments ... " , it is neither essential nor natural that this be done via (4.3) or its generalizations. Indeed (4.3) requires that the chance of having an odd number of crossover points in the first and third of three consecutive intervals on a meiotic product is simply a function of the total map length of these two intervals, and is independent of the map distance between them. This is inconsistent with most data on interference, which indicates that the extent of the interference between two intervals decreases from its highest level when they are adjacent, to a negligible level when they are well separated. Furthermore, using adjectives such as "illegitimate", "not valid" or "unrealistic" to describe map functions, whatever their motivation, for failing to be multilocus feasible must be premature, unless it has been shown that such map functions cannot arise in any probability model for recombination. As we shall see shortly, essentially all map functions currently in the literature can arise in association with stationary renewal chiasma processes and the assumption of NCI. Finally, it is still possible to derive non-trivial constraints on M from the incomplete set of equations relating values of M to multilocus recombination probabilities, without completing the set of equations in what now seems to be a somewhat arbitrary manner. The following argument is meant to be illustrative, for a systematic study along the lines sketched below has yet to be carried out. Let us go back to the six equations involving (pi l i 2 i 3 ) and values of M discussed above. A simple calculation yields the following equation
What can we learn from this? In general, perhaps not much, but under NCI, it is easy to check that P11l ~ P10l. This is a simple consequence of
76
T.P. SPEED
=
=
the generalized Mather formulae: P111 ~ q111 and P10l ~ qll1 + ~ qlOl. Thus we have shown that under NCI the left-hand side of (4.6) above is non-positive. Put d l = d 3 = hand d 2 = d, and divide by h 2 ; if M is twice differentiable, we deduce that M"(d) ~ 0. Thus map functions for processes which satisfy NCI must be bounded between and ~, have nonnegative first derivatives and non-positive second derivatives.
°
5. Connexions between map functions and chiasma processes In the previous section we saw that a map function M which satisfies not only the constraints defining multilocus feasiblility involving unions of not necessarily contiguous intervals, but also the stronger constraints corresponding to NCI, is representable as M = ~(1 - G) where G is completely monotone, G(O) 1 and G'(O) -2. We now show that in a sense such M only arise in the context of count-location chiasma processes. PROPOSITION 5.1. Suppose X to be a count-location chiasma process satisfying NC!. Then X has a map function M such that for any union A of intervals in [0,1) with total map length dA , we have
=
(5.1)
=
M(d A )
= ~ [1- Zx(A)].
Conversely, suppose X to be a chiasma process satisfying NCI, with a map function M. If M satisfies (5.1) for every union A of intervals with total map length dA , then there is a discrete distribution c and a diffuse measure F on [0, 1) such that X has the same distribution as the count-location process with count distribution c and location distribution F.
Remarks. (a) The first half of this proposition is in the work of Karlin and Liberman (1978, 1979); we simply recall it to set our notation in place. They showed that if X is a count-location chiasma process with count distribution c = (Ck), and we assume NCI, then X has the map function M given by (5.2)
M(d)
= ~ [1- c (1- f)] ,
where c(s) L:k>O Ck Sk is the probability generating function of c and 2L = L:k>O kCk is- the mean number of crossover events on the bivalent. The same calculation that proves (5.2) also proves (5.1). (b) The second half of the proposition has also been proved previously, see Evans et al. (1993). We offer here a more analytic although less direct proof, making use of the facts concerning G = 1 - 2M listed before the statement of the proposition. Proof of the second half. Suppose X and M to be as postulated, and define the measure F = A J.lx and the sequence Ck = (_L)k G(k)(L)/k!, k = 0, 1,· .. , where G = 1- 2M. We assert that F is a probability measure on [0, 1), that c = (Ck) is a probability distribution on 0,1,2,···, and that X has the same distribution as the count-location chiasma process with count distribution (Ck) and location measure F. The first two assertions are easily
WHAT IS A GENETIC MAP FUNCTION?
77
checked. F is clearly a probability measure on [0,1). As for the numbers Ck, they are clearly non-negative, since G must be completely monotone by the argument of the previous section. Here we make our first use of (5.1), not just for intervals A, but for unions of intervals. Furthermore, Lk>O Ck = Lk>O (_L)k G(k)(L)/k! = G(L - L) = G(O) = l. -We now see that the probability generating function of this discrete distribution is just G(L(l - s)):
c(s)
= 2:>k Ck = ~)-sL)k G(k)(L)/k! = G(L(l- s)), k~O
k~O
as stated. It follows from Remark (a) above that this count-location process with count distribution c and location density F has map function M = Mc,F given by
Mc,F(d)
=~
[1- G (L (1- (1- t)))]
= M(d).
Since both Xc,F and X have the same map function, and these map functions satisfy (5.1) for unions of intervals, they have the same avoidance functions, and hence the same distribution. This completes the proof. 6. Interference, map distance and differential equations Crossover interference was described by Sturtevant (1915) and by Muller (1916), see Foss et al (1993) for a summary of the history of this topic. The traditional measure of interference is the coincidence c, which is the ratio of the chance of simultaneous recombination across both of two disjoint intervals 11 and 12 on a chromosome, to the product of the marginal probabilities of recombination across the intervals: c=
(6.1)
ru
(rlO + r11)(ro1
+ r11)
.
is the chance of i recombinations across interval 11 and h, i, j = 0,1. If there were no crossover position interference, and no chromatid interference, the coincidence would equal one. Observed coincidences tend to be near zero for small, closely linked intervals, increasing to one for more distant intervals. A number of forms of c have been used in the literature to describe the dependence of coincidence on map distance, and we refer to two such here. Haldane (1919) introduced what we call the semi-infinitesimaI3-point coincidence function (Liberman and Karlin (1984) call it the marginal coincidence function) c3(d) = limh-+o c(d, h), where c(d, h) is the coincidence between an interval It of map length d and a contiguous interval h of map length h. Here and in what follows we suppose that all limits exist, and are independent of the locations of the defining intervals, assumptions that are valid when chiasma processes are simple stationary point processes and NCI holds. Haldane (1919) used C3 to obtain the following differential In this formula
rij
j recombinations across
78
T.P. SPEED
equation for a map function: M(O) (6.2)
= 0, and
M'(d) = 1 - 2C3(d)M(d).
We refer to Liberman and Karlin (1984) for more details concerning this approach to map functions, and for a variety of examples obtained by this method. Karlin (1984) lists two difficulties with the construction of map functions using (6.2), the major one being that we do not know in advance which functions c3(d) will lead to map functions which can arise in practice. As we will see, c3(d) = 2M( d) and C3( d) = 2M(d)3 do lead to map functions which can arise, but there is no obvious way in which this could have been known in advance. Just as we saw in section 4 that a map function can define three-locus but not four-locus recombination probabilities, so we can see that the coincidence function C3 can only capture aspects of the chiasma or crossover process involving three but no more loci. An alternative form of c which we term the infinitesimal 4-point coincidence function c4(d) is defined as limh-+D limk-+D c(d, h, k), where c(d, h, k) is the coincidence between intervals hand [2 of map lengths hand k respectively, separated by map distance d. This measure is called 54 by Foss et al (1993), and seems to capture a more important aspect of crossover position interference than does C3. For example, by their construction, non Poisson count location processes manifest no crossover position interference. However, while c4(d) is constant for such processes, as one might expect, c3(d) is not constant. The latter results from the fact that the definition of C3( d) involves a non-infinitesimal interval of length d, and so C3 (d) reflects features of the marginal probability of recombinations occurring in an interval more than the interference of recombination events. 7. Stationary renewal chiasma processes. In this section we show that stationary renewal chiasma processes, i.e. renewal chiasma processes that are stationary with respect to their intensity measure, when combined with the assumption of NeI, give rise to a large class of map functions which are not multilocus feasible in the sense of Liberman and Karlin (1984). Indeed we will see in the next section that all of the map functions proposed to date can be associated with stationary renewal chiasma processes. It follows that there are many chiasma processes with map functions M for which (5.1) holds for all intervals A, but not all unions A of intervals. We will also find that it is possible for two stationary chiasma processes to have different distributions but the same map function; indeed one can satisfy (5.1) for all unions A of intervals, implying that the map function is multilocus feasible (and more), while the other process does not satisfy (5.1) for all such A. The realism or otherwise of stationary renewal chiasma processes is discussed in section 9 below. We begin by listing a set of conditions (A) on a function M from [0, L) to [0,1), where L may be finite or infinite. These conditions and the proposition which follows are from Zhao (1995).
79
WHAT IS A GENETIC MAP FUNCTION?
(AO) M(O) = 0; (AI) limdl£ M(d) = ~; (A2) M'(d) ~ 0 for all d; (A3) M'(O) = 1; (A4) limdl£ M'(d) = 0; (A5) M"(d) :::; 0 for all d.
We note in passing that if L = 00, then (A4) follows easily from the other conditions. However the (Morgan) map function M(d) = d, 0 :::; d :::; ~, shows that (A4) is needed in the following proposition. PROPOSITION 7.1. Let M be the map function for a stationary renewal chiasma process satisfying NCI on a chromosome arm of infinite map length. Then M satisfies conditions (A). Conversely, suppose that a function M : [0, L) - [0,1) satisfies conditions (A), where L may be finite or infinite. Then there is a stationary renewal chiasma process satisfying NCI whose map function is M. In both cases, the renewal density is - M". Proof. Suppose that X is such a stationary renewal chiasma process with renewal density f. Without loss of generality we may suppose that the mean inter-arrival time is ~, so that the metric with respect to which the process is stationary is that defining map distance. If F is the cumulative distribution function of f, then the residual lifetime density of the process is 2(1 - F) and the avoidance function for an interval I of map length dis thus
Zx(I)
=
1
00
2(1- F(y)) dy.
By Mather's formula (3.2), we have:
(7.1) Conditions (A) are now easily checked. Conversely, suppose that we have a function M satisfying conditions (A). We can see that -M"(Y) ~ 0 by (A5). Further, by (AO) and (A4),
1£
and
-1£
-M"(y)dy = M'(O) - M'(L) = 1,
Y M"(y) dy
= [y M'(y)]; +
1£
M'(y) dy
by (A4), (AO), and (AI). Finally, we obtain
M(d)
=~
[1-1£
2M'(y) dY]
=~ ,
80
T.P. SPEED
Thus M is the map function associated with the stationary renewal chiasma process with the renewal density -Mil having mean ~ and residual lifetime density 2M'. This completes our proof. As indicated in the introduction to this section, this proposition allows a very wide range of functions to arise as map functions; we will give examples in the next section. It is interesting to note that map functions M = ~ [1- G) where G is completely monotone, G(O) = 1 and G'(O) = -2, also satisfy conditions (A) when we permit L = 00. It is immediate that such M satisfy (AO), (A2), (A3) and (A5). To see that they also satisfy (AI) and (A4), it is easiest to use the representation of such a G as the Laplace transform of a positive measure, i.e. to represent M in the form
where
.]},
where I is the length of the chromosome. If the search includes the entire genome, by the independent assortment of chromosomes at meiosis, the overall p-value is approximated by 1- lli:l exp {-Ii (2.6)
[(,ih~ (2b -
N)
>.] }
= 1- exp { -(~) ~ (2b -
N)
>. [E7=1/i]},
Similar approximations are available for other types of relative pairs. For half siblings, the process is still Markovian. However, for first cousins and avuncular pairs, the process is a function of Markov processes, but simple approximations for the p-value are still available. Feingold (1993) treated the cases of siblings separately since they can be IBD on one or two chromosomes and are somewhat more complicated to handle. The reader is referred to the original paper for more details. Problems such as providing confidence intervals and combining various types of relatives are more complicated. Gaussian approximations to the Markov processes were introduced by Feingold et al. (1993) to get some insight into these more complex problems. They require larger sample sizes but are easier to work with as will be seen shortly. Again, we outline the
118
JOSEE DUPUIS
grandparent-grandchild case in detail and comment briefly on the generalization to other types of relatives. Let X t -Npo Zt = .jN , which is just the scaled version of the test used in the Markov chain framework. As the number of pairs (N) increases, Zt converges in distribution to an Ornstein-Uhlenbeck process with covariance function R(t) 0'2e-.Bl t l, where (3 2A and 0'2 Po(1 - Po) 1/4. The mean function is derived as follows. On chromosomes not containing the disease susceptibility gene locus r, Zt has mean O. On the chromosome with the locus r,
=
E(Zt)
=
= ~[E(Xt) -
=
Npo]
=
= VNapoe-.Blt-r l = ee-.Blt-r l,
e
where = ,[Napo. Feingold et al. (1993) showed that the likelihood ratio statistic for testing for the presence of a gene locus is maxt Zt!O', which is equivalent to the test for the Markov process version. One can find p-values (and hence define a threshold for the test) and power approximations easily using Gaussian theory. The authors suggested the following two approximations, which were shown to be quite accurate through simulations. (2.7)
PO{ m?,x;
> b}
~ 1 - ~(b) + (31b¢;(b),
and
where I is the length of the chromosome in centimorgans (cM), ¢;(x) and 'ltl)~exp(-6>'ltl)]
[ex p ( -;4>'ltl)
+ exp ( -;6>'ltl) + exp ( -;;8>'ltl)]
(1'2
exp (-4Altl)
,8 2A 4A 5A 16A/3 4A
(1'2
1/4 1/4 1/4 3/16 1/8
For the case of siblings, let
Z I t--
Xlt _!:i.
Z 2t --
..jN 2 ,
X2t _!:i.
..jN 4 ,
where Xlt and X2t are the number of pairs IBD on one or two chromosomes, respectively. The test statistic becomes max t
Zlt/2 + Z2t (1'
,
and the approximations can be used with the values provided in Table 2.4. If the sample consists of different types of relatives pairs, the Z statistic can be calculated for each relative types separately and combined with appropriate weights. Feingold, et al. (1993) discussed the issue of optimal weights and how to modify slightly equation (2.7) and (2.8) to find the threshold and power of the test combining relative pairs. For a numerical example, let AO = 5 and the type-I error be 0.05 for a genome wide search. When using thresholds of 4.15, 3.9, 4.08, 4.14 and 4.08 for C, G, HS, Nand S respectively, we find that 31, 43, 47, 49 and 66 pairs are required to insure a power of 80%. Feingold reached the same conclusion as Elston's as to the efficiency of the various relative pairs, namely that
G> HS>N >S, however sibling are most efficient for small values of AO. Cousins could be most or least efficient depending on the value of Ao.
JOSEE DUPUIS
120
When a set of discrete markers is used as opposed to a continuous specification of identity by descent, the power to detect a trait locus located mid-way between markers can be greatly reduced. To remedy this situation, Lander and Botstein (1986) proposed a method they called interval mapping to exploit the information provided by flanking markers in calculating the likelihood at any point on the genome. See Dupuis (1994) for more details on how to implement the interval mapping procedure in the present context and for a simulation study of the power of interval mapping. Feingold's methods are most appropriate for qualitative phenotypes such as affected/non-affected. For continuous traits such as blood pressure or severity of a particular disease, the phenotype is better modeled as a continuous variable. Statistical methods for genome wide search for quantitative trait loci have been developed and applied. In the next section, we review QTL methods for experimental organisms. 3. QTL in experimental organisms. Lander and Botstein (1989) provided a method of searching the whole genome for QTLs. Their method relies on being able to arrange a cross (i.e. intercross or backcross) between two inbred strains with large difference in the mean value of the phenotype of interest. The organisms are assumed to be homozygous at each locus. A regression model is used to express the phenotype as a function of the genotype as
(3.1 )
= C + agi(r) + d I(u;(r)=l) + ei, is the phenotype of individual i, gi(r) = 0,1 or 2, is the number of Yi
where Yi alleles at locus r coming from a predetermined original strain and ei is the error term. The parameters c, a, d and r are unknown. One important assumption ofthe model is that gi(r) and ei are independent, i.e. the variance in phenotype is the sum of the environmental and genetic variances. In the case of a backcross design or in the absence of a dominance effect, the model reduces to
(3.2)
Yi =c+agi(r)+ei.
The model generalizes easily to include more that one locus. Moreover, when the model does not contain all contributing loci, the variance of the error terms is inflated by the genetic variation of the excluded loci. Lander and Botstein developed a statistical test for the backcross design, which is equivalent to testing the hypothesis a = O. Assuming that the ei's are normally distributed with known variance (J";, the log likelihood ratio statistic is
n a(t)2 m a82 x- ' t
(J" e
TRAIT MAPPING USING A DENSE SET OF MARKERS
121
where a(t) is the least squares estimate of the additive effect at r = t. If one wants to work with the lod score instead of the likelihood ratio, the following conversion applies:
lod(t) = 8n 2 (loglOe )a(t)2. (Te
Note that a(t) can be estimated via least squares only when gi(t) is known, i.e. when there is a marker at t. The E-M algorithm (Dempster et al. (1977)) can be used to calculate lod(t) between markers, with the genotype treated as the " missing" observations. This method of calculating the lod score between markers is called interval mapping by the authors and can be quite computer intensive. Haley and Knott (1992) suggested a simple approximation to the interval mapping step using linear regression. Either method can be used to find the most likely location of the QTLs. The usual threshold of 3.0 applied to the lod score is not appropriate when searching the entire genome for QTLs. If an overall significance level of "I is desired, an appropriate threshold to use would be the value b that satisfies P( max max lodc(t) > b) = "I, l:Sc:SC
t
-
where C is the number of chromosomes. Lander and Botstein (1989) noticed that .,fiia(t)j2 tends in distribution to an Ornstein-Uhlenbeck process (as n -+ 00). By the central limit theorem, this holds even when the ei's are not normally distributed. Using this asymptotic distribution, they calculated the thresholds for the test with continuous markers using the approximation
P( max max lodc(t) > b) ~ (C + 2Gb')x2(b'), l:Sc:SC
t
-
where G is the genetic length in Morgans and b' = (2 log 10)b and X2 is the tail probability of a chi-square distribution with 1 degree of freedom. For discrete maps with markers equispaced at every .6. centimorgans, Lander and Botstein provided approximate threshold values based on an extensive simulation. However, approximation (2.7) can be used with a discreteness correction factor to obtain thresholds that are very similar to the ones provided by Lander and Botstein, without the need for computer simulations. For example, for a genome with 10 chromosomes of length 100 centimorgans, Lander and Botstein's simulations gave thresholds of 2.8, 2.7, 2.5 and 2.4 for discrete markers with .6. = 1, 5, 10 and 20 centimorgans, while approximation (2.7) gives thresholds of 2.9,2.72.6 and 2.4. Lander and Botstein's method was applied successfully to find a QTL for blood pressure in rats (Jacob et al. (1991» and QTLs for soluble concentration, mass per fruit and pH in tomatoes (Paterson et al. (1991». The tomato data was obtained through an intercross design, which allowed
122
JOSEE DUPUIS
for the estimation of the dominance effects. However, only the tests for additive effects were performed since the thresholds for the full model with both additive and dominance effects were not provided by Lander and Botstein (1989). For the full model with dominance effect, the lod score is max 4n 2 (loglO e)[a(t? t
(J'e
+ d(t)2/2] ,
where a(t) and d(t) are the least square estimators of the additive and dominance effects of model (3.1). By virtue of the intercross design, g(t) and I(g(r)=l) are orthogonal vectors so that a(t) and d(t) are independent. Dupuis (1994) showed that a good approximation for both the cases of continuous data and that of discrete data is
where 1 is the length of the chromosome. The above approximation is based on the fact that as n --+ 00, the scaled processes a(t) and d(t) tend in distribution to independent Ornstein-Uhlenbeck processes with covariance functionse- 2 >'l t l and e- 4 >'l t l, respectively. For the tomato genome (12 chromosomes of approximate length 100 centimorgans), the above approximation gives thresholds of 3.14,3.35,3.56, 3.87 for discrete maps of markers at 20,10, 5 and 1 centimorgan apart respectively. As expected, these are greater than the thresholds for the test for an additive effect alone, which were 2.48, 2.64, 2.78 and 2.99. Once it has been established that a QTL influences the trait, a confidence interval for the locus would provide a chromosomal region in which to concentrate the search for the exact location of the QTL. Confidence regions for the gene locus are usually constructed using lod support intervals. In traditional linkage analysis, a l-lod support interval corresponds approximately to a 95% confidence interval (Ott p. 67). Unfortunately, no similar relation exists between the lod-support interval in the present context and fixed coverage confidence regions. Dupuis (1994) studied the use oflod support intervals, likelihood methods and Bayesian credible sets to construct confidence regions. We discuss each method briefly. A x-Iod support interval includes all locations v on a chromosome such that
lod(v) ~ mtx1od(t) - x. For models with or without dominance effects, Dupuis (1994) found, in a simulation study, that a 1.5-lod support interval corresponds approximately to a 95% confidence interval when the map of markers is very dense ( '" 1 cm). For more sparse maps (..... 20 cm), a l-lod support usually provide
TRAIT MAPPING USING A DENSE SET OF MARKERS
123
95% coverage. The lod support intervals are easy to compute. However, care must be exercised when looking at 1-lod support intervals as they do not always correspond to 95% confidence regions. A set which has posterior probability of 1 - "I can also be used as a confidence region. Such sets, called Bayesian credible sets, have been shown to have good frequentist properties by Zhang (1991) in a similar context to QTL mapping. A Bayesian credible set is constructed by choosing c-y such that
B-y
= {r: 7r(r\y,g) > c-y}
and
f
JB..,
7r(r\y, g)dr
=1-
"I.
Here y = {Yl, ... , Yn}, g = {gt, ... , gn} and gi is the set of all marker genotypes for individual i. The posterior probability 7r(r\y, g) is often easy to compute and depends on the prior distribution on the location r and the additive and dominance effects a and d. If one takes uninformative priors on all parameters,
7r( r\y, g)
= Jor'l.Z2 ' e ds 4
•
where
ynd(t) Zt
= (~) = (~~t)) .;2(J"~
and a(t) and d(t) are the least square estimates or the interval mapping equivalent when the QTL is located at r. If one takes a bivariate normal prior on the effect sizes with means lh and (}2, variances 1J1 and 1J2 and null correlation, then
f
7r(r\y, g) =
(Xr + (Jd1Jf e 2(1 + 1/1Ji)
f
(Yr + (J2/7J~ e 2(1 + 1/1Jn
--------,2;:-------;2,.I
f
o
(Xs + (}d1Jf) (y. + (J2/1J~)
e 2(1
+ 1/1Ji)
e 2(1
+ 1/1Jn
ds
Other priors give similar results. As one might expect, the more appropriate the prior is to the data, the tighter the confidence region is for a given confidence level.
124
JOSEE DUPUIS
Yet another way of providing a confidence interval for a QTL relies on using likelihood methods for change points (Siegmund 1988). The derivative of the likelihood ratio function is discontinuous at the locus r, making it a change point by definition. We show how to use likelihood methods by establishing a correspondence between the test for the presence of QTL and a confidence region. The acceptance region for the test for the presence of a gene locus has the form
Since the conditional probability of Av given Zv does not depend on the additive effect a nor the dominance effect d, we can choose k such that
The set of values v that are not rejected by the likelihood ratio test form a (1-1')100% confidence region for the gene locus. It is not necessary to solve for k since
P(mtaxIIZtI12> (mlxIlZtI12)Ob3IZv) ~ P(mFIIZtI12 > k (3.3) If we use the approximation
+ IIZvI12IZv)='Y'
p{ maxo:S;;~91IZi~1I > b IZo = x = (:~) } (3.4)
in conjunction with equation (3.3), we can construct a confidence interval by including all points v such that
(3.5) Equation (3.4) depends on the assumption that the process IIZ(t)1I2 is the sum of the square of two Ornstein-Uhlenbeck processes, which is not satisfied by the interval mapping process, so that the likelihood procedure is most helpful for dense maps of markers. One advantage of the likelihood method is that it can be easily modified to provide a joint confidence region for the location of the QTL and its additive and dominance effects. A simulation study (Dupuis 1994) showed that there is not much difference between the Bayes credible set and the lod support interval (modified
TRAIT MAPPING USING A DENSE SET OF MARKERS
125
to have the correct probability coverage) in terms of size of the confidence region and the probability of including the true locus. However, as one expects, the Bayes credible method performs better than the lod support when the prior is appropriate to the data and not so well when the prior is less appropriate. The likelihood method gave much wider intervals than both the lod support and the Bayesian sets. In a more general setting, Kong and Wright (1994) present a series of interesting results which we summarize. The main focus of their paper is to identify a small region containing a gene locus. They assume that the locus has been located on a chromosome of length 1 and they concentrate their efforts on the chromosome of interest. Their data consist of n backcross individuals. At any location d on the genome of the individuals, y;(d) = 0 or 1. Kong and Wright(1994) define the distribution of the phenotypes to be
and
!(y;jYi(d) = 1) = !1(yd. Note that this formulation can include qualitative phenotypes such as affected or not, and is a generalization of Lander and Botstein's additive model where !O(Yi) is the normal density with mean J.I and variance (72 and !1 (yd is a normal density with the same variance and mean J.I + a. The authors wrote the likelihood ratio as
L(d) = ITi=lLi(d) = ITi=l[JO(Yi)P(gi(d) = 0lgi) + !1(Yi)P(gi(d) = 1Igi)], and the log-likelihood as
l(d) =
n
n
i=l
;=1
'L l;(d) = 'L log [!o(Yi)P(y;(d) = 0lgi) + !1(Yi)P(gi(d) = 1Igi)],
where g; is the observable genotype information (i.e. marker genotypes). Kong and Wright (1994) look at the rate of convergence of the likelihood ratio and the log likelihood ratio under three different marker densities: 6 0 (or continuous specification of the genotype), 6 o(l/n) and 6 = cn- s , 0 ~ s < 1. Here 6 is the distance between markers once the chromosome has been rescaled to have length 1. We present their result for the first case in detail and mention the implications of the study for less dense maps. All of their results assume that the Haldane map function holds. The authors showed that if nld - r l - 00 as n - 00, then - p 0, where dn is a sequence oflocatio~s which depends on the sample size n (see Result 2 of Kong and Wright (1194».
=
=
n
Lltr»)
126
JOSEE DUPUIS
The above implies that the rate of convergence of the maximum likelihood estimate of the true location of the gene locus to the true locus is O(l/n). This results holds for continuous markers only. Different rates of convergence were obtained for equispaced markers. To derive the above results, the authors made the following observations. Let Pd be the probability of observing the phenotypes {Yl, . .. ,Yn} and genotypes {gl, ... ,gn}, given that there exists a gene locus at d. Then the Kullback-Leibler (KL) distance between Pd and Pr is
K(Pd,Pr)
= nO;,r {K(fo,lt) + K(It,fo)}.
Here Od,r is the recombination fraction between d and r. This equation combined with the fact that Od,r '" Id - rl when small gives the convergence result. More general results are available when dn = r + tin, where 0 ::; t ::; T for some fixed T > O. The authors showed that the processes l(r + ~) - l(r) and l(r - ~) - l(r) converge weakly under the uniform metric to independent compound Poisson processes with point intensity 1. The increment distribution is known and depends on the KL distance between fa and It. Convergence rates for 8 = o(l/n) and 8 = cn- s , 0::; s < s vary between O(l/v'n) and O(l/n). From the results for less dense maps of markers, the authors concluded that for maximum efficiency, n, the number of backcross individuals and m, the number of markers, should be of comparable sizes. Kong and Wright (1994) commented briefly on the problems of misspecification of fa and It or of fi being known up to nuisance parameter. They argue that their results still hold in those cases. However, more work needs to be done in the case of polygenic traits. 4. Quantitative trait loci in humans. Establishing linkage for quantitative traits in humans is more complex than in experimental organisms since the environmental factors can't be controlled by a carefully selected breeding design. Nevertheless, methods for a QTL linkage in pedigrees have been studied by Boehnke (1990) and Demenais et al. (1988), amongst others. In this review we restrict our attention to the use of sibling pairs as opposed to whole pedigree for finding QTLs. We begin with a description of Haseman and Elston's (1972) paper for testing linkage to a QTL using identity-by-descent scores from sibling pairs. We then discuss Goldgar's (1990) and Guo's (1994) modifications of the original method. Haseman and Elston assume that a gene locus with two alleles, Band b, with respective frequencies P and q, is influencing the quantitative trait of interest in the following way. For n sibling pairs, they wrote the phenotypes of the ith pair as Xil Xi2
+ gil + eil fJ + gi2 + ei2,
fJ
127
TRAIT MAPPING USING A DENSE SET OF MARKERS
where J-t is the overall mean and gij and eij are the genetic and environmental effects. They defined
gij
=
a
for a BB individual
d
for a Bb individual for a bb individual.
-a
Note that this notation is consistent with Lander and Botstein's full model with the overall mean shifted by an amount of -a in the present case. One can see that the additive and dominance variances can be written as O'~ O'~
2pq[a - d(p _ q)]2 4p2q 2d 2,
and the total genetic variance O'~ is the sum of the above two components. Moreover, O'~ = E(eJ) = E[(ejl - ej2)2]), which is a function of the environmental variance and covariance between siblings. The authors showed that based on the proportion of genes identicalby-descent (7rj) at the quantitative trait locus, the conditional expectation of the squared difference Yj = (Xj1 - Xj2)2 in sibling phenotypes are as follows:
= If O'~
O'~
+ O'~ + 2O'~
O'~
+ 2O'~ + 2O'~.
= 0, one can write 7rj
= 0,
1
2'
1,
and (3 = -20';. When O'~ # 0, E(/3) --+ -2O'~ as n --+ 00. Therefore, we assume that O'~ = for the rest of this section. If 7rj is unknown but instead is estimated from a marker at a recombination distance () from the true QTL, Haseman and Elston showed that
°
where
128
JOSEE DUPUIS
and ifj is the estimated identity-by-descent proportion at the marker locus. One can use regression with 7rj or ifj to test the hypothesis 17; = O. When 7rj is used, simple regression does not allow for the estimation of both and (J, the recombination fraction between the markers and the QTL. However, the authors devised a more complicated scheme to estimate both parameters via maximum likelihood which will not be discussed here. The method proposed by Haseman and Elston (1972) is developed under the assumption that only one gene influences the quantitative trait and a single marker is then tested for linkage to the QTL. However, QTLs are often influenced by more than one locus. This led Goldgar (1990) to extend the method proposed by Haseman and Elston in the following way. First, he allowed in the model for more than one gene influencing the quantitative trait. Second, instead of testing a single marker for linkage, he tested linkage to a region of the chromosome that could be as small as a marker or as large as an entire chromosome. Moreover, Goldgar's method is not restricted to sibling pairs but rather to sibship of any sizes (~ 2). Goldgar (1990) assumes that the phenotype is determined by a genetic effect due to a chromosomal region of interest (C), genetic effects on other chromosomes (A) and some random environmental effects (E), assumed normally distributed. The total phenotypic variance is expressed as
0';
and is assumed equal to unity. The heritability of the trait, h2 = VA
+ Vc,
VT
is also assumed to be known and hence, the total genetic variance, VG = Vc + VA, is also known. Instead of taking the difference in phenotypes between siblings, Goldgar looked at the covariance between the phenotypic values of siblings i and j and showed that VA (1- P) COV(Xi,Xj) = R;jVc+T= [RijP+ 2 ]VG
where Rij is the true proportion of the genome that is identical-by-descent between siblings i and j in the region of the chromosome under study and P = Vc/VG. Unless the data comes from Genomic Mismatch Scanning, Rij is not observed. However, the mean and variance of R;j can be derived conditionally on the identity-by-descent status of the markers in the region C of the chromosome of interest. If we denote Rm and Rp to be the proportion of the genome identical-by-descent between the siblings on the maternal and paternal chromosome respectively, Goldgar used R* = E(Rm)
+ E(Rp)
2
129
TRAIT MAPPING USING A DENSE SET OF MARKERS
to estimate R, where the expectation is conditional on the markers in the region C. Under the above assumptions, the likelihood of the observed phenotypic values for a sibling pair or a sibship is given by a multivariate normal distribution with mean 0 and covariance matrix
[RijP VT
+ (1- P)/2] VG
for i f:. j for i = j
For n independent families, the likelihood is the product of n such multivariate normal distributions. Numerical optimization techniques are used to obtain the maximum likelihood estimate for P and to test P 0, which is equivalent to testing linkage to the region of the chromosome C under study. Goldgar proposed using a X2 approximation to the likelihood ratio test for this purpose. In a simulation study, Goldgar showed that the likelihood ratio test was, on average, 50-80% more powerful than the Haseman and Elston regression method. The appendix of Goldgar's 1990 paper is devoted to the calculation of E(R) and VCR), conditional on the marker information. Guo (1994) presented a different approach to calculating E(R) and VCR). Assuming the Haldane mapping function, Guo (1994) expressed the maternal and paternal chromosome of the siblings as independent two-state Markov chains taking value 1 when they are identical by descent and value 0 otherwise. Using this representation, Guo (1994) expressed E(R) and E(R2) as stochastic integrals of simple functions. His method for calculating the mean and variance of the proportion of the genome shared identical by descent between siblings is simpler to evaluate than Goldgar's and is more general. Both methods yield identical results. Both Haseman and Elston's regression approach and Goldgar's method are aimed at testing a specific marker or a chromosomal region for linkage to a quantitative trait locus. When a search of the entire genome is envisioned, other methods might be more appropriate. Fulker and Cardon (1994) extended the Haseman and Elston procedure to do a global search. Fulker and Cardon (1994) used a method similar to that of Haley and Knott (1992) to use the information from flanking markers to test for linkage using a map of linked markers spanning the whole genome. This method, dubbed interval mapping, has the advantage of producing an estimate of the location of the quantitative trait locus and of its effect. From a simulation study, they showed that there is an increase in power from using interval mapping. All simulations were carried out under the assumption that only one locus influences the trait. The authors concluded that the interval mapping method for global search is most efficient in using a coarse map of linked markers to identify candidate loci.
=
130
JOSEE DUPUIS
5. Discussion. We have decribed methods to map qualitative and quantitative traits in humans using relative pairs and quantitative traits in experimental organisms using backcross and intercross mating design. The mathematical problems involved in mapping disease susceptibility genes using pairs of relatives are very similar to those for mapping QTL in organisms. The methods studied assumed a map of fully informative markers. Risch (1990c) studied the effect of marker polymorphisms on the power of the test using relative pairs to detect linkage to a disease susceptibility gene. He suggested a two stage strategy that would involve typing more family members at the markers that show at least suggestive evidence of linkage in order to improve the polymorphisms of the markers and the power of the test. Elston (1990) discussed ways of modifying his optimal design to accommodate less polymorphic markers and Goldgar (1990) and Guo (1994) allow for non informative markers in their calculation of E(R) and V(R). For the experimental organisms, it is usually easier to choose the pure line strains so that all the markers are close to being fully informative. The two-stage design proposed by Elston (1992) could also be beneficial for QTL mapping in experimental organisms. However, obtaining more experimental organisms is usually much easier than recruiting more affected relative pairs and having an economical design may not be as much of an Issue. The methods were presented under the simple assumption that only one gene affects the trait. However, most of the methods in this paper easily extend to multilocus diseases. See Risch (1990b) and Dupuis (1994) for the case of polygenic diseases and Jansen (1993) for the case of quantitative traits influenced by more than one locus in experimental organisms. Goldgar's method for mapping quantitative traits in humans allows for some genetic effects to be present on other chromosomes than the one under study. For a thorough review of methods for mapping complex traits, see Lander and Schork (1994). REFERENCES Aldous D, (1989) Probability Approximations via the Poisson Clumping Heuristic, Springer-Verlag, New York. Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol2: 85-97. Boehnke M (1990) Sample-size guidelines for linkage analysis of a dominant locus for a quantitative trait by the method of lod scores, Am J Hum Genet 47: 218-227. Chakravarti A, Badner JA, Li CC (1987) Test of linkage and heterogeneity in Mendelian disease using identity by descent scores, Genet Epidemiol4: 255-266. Demenais F, Lathrop GM, Lalouel JM (1988) Detection of linkage between a quantitative trait and a marker locus by the lod score method: sample size and sampling considerations, Ann Hum Genet 52: 237-246. Dempster AP, Laird DB, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B39: 1-22. Dupuis J (1994) Statistical methods associated with mapping quantitative and complex traits from genomic mismatch scanning data, Stanford University Ph D thesis.
TRAIT MAPPING USING A DENSE SET OF MARKERS
131
Elston RC (1992) Designs for the global search of human genome by linkage analysis, Proc Intern Biometric Conf 39-51. Feingold, E (1993) Markov processes for modeling and analyzing a new genetic mapping method, J Appl Probab 30: 766-779. Feingold E, Brown PO, Siegmund D (1993) Gaussian models for genetic linkage analysis using complete high resolution maps of identity-by-descent, Am J Hum Genet 53: 234-251. Fulker DW, Cardon LR (1994) A sib-pair approach to interval mapping of quantitative trait loci, Am J Hum Genet 54: 1092-1103. Goldgar DE (1990) Multipoint analysis of human quantitative genetic variation, Am J Hum Genet 47: 957-967. Guo S-W (1994) Computation of identity-by-descent proportions shared by two siblings, Am J Hum Genet 54: 1104-1109. Haley CS, Knott SA (1992) A simple regression method for mapping quantitative trait loci in line crosses using flanking markers, Heredity 69: 315-324. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus, Behavior Genet 2: 3-19. Jacob HJ, Lindpaintner K, Lincoln SE, Kusumi K, Bunker RK, Mao Y-P, Ganten D, Dzau V J, Lander ES (1991) Genetic mapping of a gene causing hypertension in the stroke-prone spontaneously hypertensive rat, Cell 67: 213-224. Jansen RC (1993) Interval mapping of multiple quantitative trait loci, Genetics 135: 205-211. Kong A, Wright F (1994) Asymptotic theory for gene mapping, Proc Natl Acad Sci 91: 9705-9709. Lander ES, Botstein D (1986) Strategies for studying heterogeneous genetic traits in humans by using a linkage map of restriction fragment length polymorphisms, Proc Nat Acad Sci USA, 83: 7353-7357. Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps, Genetics 121: 185-199. Lander ES, Schork NJ (1994) Genetic Dissection of complex traits, Science 265: 20372048. Nelson SF, McCusker JH, Sander MA, Kee Y, Modrich P, Brown PO (1993) Genomic mismatch scanning: A new approach to genetic linkage mapping, Nature Genet 4: 11-18. Ott J (1991) Analysis of Human Genetic Linkage, Revised Edition, Johns Hopkins University Press, Baltimore. Paterson AH, Damon S, Hewitt JD, Zarnir D, Rabinowithch HD, Lincoln SE, Lander ES, Tanksley SD (1991) Mendelian factors underlying quantitative traits in tomato: Comparison across species, generations, and environments, Genetics 127: 181-197. Risch N (1990a,b,c) Linkage strategies for genetically complex traits I, II, III. The power of affected relative pairs, Am J Hum Genet 46: 222-228, 229-241,242-253. S.A.G .E. (1992) Statistical analysis for genetic epidemiology, Release 2.1, Computer program package available from the Department of Biometry and Genetics, Louisiana State Medical Center, New Orleans. Siegmund D (1985) Sequential Analysis: Tests and Confidence Intervals, SpringerVerlag, New York. Siegmund D (1988) Confidence sets in change-point problems, International Statistical Review 56: 31-48. Zhang HP (1991) A study of change-point problems, Ph.D. Thesis, Stanford University.
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS D.J. BALDING", W.J. BRUNO!, E. KNILLi, AND D.C TORNEY! Abstract. Pooling (or "group testing") designs for screening clone libraries for rare "positives" are described and compared. We focus on non-adaptive designs in which, in order both to facilitate automation and to minimize the total number of pools required in multiple screenings, all the pools are specified in advance of the experiments. The designs considered include deterministic designs, such as set-packing designs, the widely-used "row and column" designs and the more general "transversal" designs, as well as random designs such as "random incidence" and "random k-set" designs. A range of possible performance measures is considered, including the expected numbers of unresolved positive and negative clones, and the probability of a one-pass solution. We describe a flexible strategy in which the experimenter chooses a compromise between the random k-set and the set-packing designs. In general, the latter have superior performance while the former are nearly as efficient and are easier to construct.
1. Introduction. We consider the problem of screening a large collection, or "library", of cloned segments of DNA against a collection of probes. For each clone-probe pair, the clone is either "positive" or "negative" for the probe. The task is to efficiently determine which clones are positive for which probes. In many situations, relatively few clones are positive for anyone probe. This occurs, for example, in the case of unique-sequence probes such as Sequence-Tagged Site (STS) markers [30]. Prior knowledge that positives are rare can be exploited by implementing a pooling strategy, which can be much more efficient than testing each clone with each probe. In the pooling strategies considered here, clones are combined in several ways to form a set of "pools". Each pool is then assayed successively with each probe. In the absence of experimental error, if the pool assay outcome is negative then it follows that all the clones in the pool are negative. If at least one clone is positive for the probe, the pool assay outcome is positive and additional assays are required to identify the positive clone( s). For examples of pooling schemes recently employed for library screening, see [1,6,7,17,20,28]. Strategies based on pooling probes, rather than clones, or a combination of clones and probes, are also possible, but seem not to have been implemented. The library screening problem is an instance of the general group testing problem. Group testing strategies are methods for isolating a few " School of Mathematical Sciences, Queen Mary and Westfield College, University of London, Mile End Road, London, E14NS, UK. Current address: Department Applied Statistics, University of Reading, PO Box 240, Reading RG6 2FN, UK. ! Theoretical Biology and Biophysics Group, Mail Stop K-710, Los Alamos National Laboratory, Los Alamos, New Mexico 87545. t Computer Research and Applications, Mail Stop B-265, Los Alamos National Laboratory, Los Alamos, New Mexico 87545. 133
134
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
positive! entities from a large collection of entities by asking queries of the form "does this set of entities contain a positive one?". An early published discussion of an application of group testing is due to Dorfman [9] who was motivated by the need to efficiently screen military personnel for syphilis. Other applications that have been discussed include the screening for defective components [36], multi-access communications [39], efficient punch card file searching [26] and magnetic core memories [26]. Possible biomedical applications include clinical testing in laboratories which process many samples [19] and screening of collections of synthesized peptides for those with highest-affinity binding in the development of new drugs [34]. Group testing problems are closely related to problems in coding theory and many results on optimal group testing designs have their origin in optimal coding problems [26]. Criteria for the efficiency of a pooling strategy vary with the experimental constraints imposed by the specific application. In early studies [9,10,23,33,39]' the primary concern was with the high cost of individual tests, and the goal was to minimize the total number of tests. Because only one probe was being considered, the cost of constructing a pool did not need to be considered separately from the cost of a test. The strategies for pooling were fully adaptive, which means that the experimenter was able to consider the outcomes of the previous experiments before deciding on which experiment to perform next. Parallelization and automation were not considered. In library screening applications, the clones are tested against many probes. Efficiency gains can thus be obtained by using the same pools for each probe. In this case the number of pools requiring construction needs to be considered as well as the number of assays to be performed. In addition, it is usually maximally efficient in library screening to construct all the pools in parallel. It may also be important to eliminate intermediate decisions so that automation of the screening process is facilitated. Because of these important advantages, non-adaptive pooling strategies are favored [3,4,6]' although highly adaptive strategies have been used in a few cases [20,35]. A non-adaptive strategy requires specifying all the pools before any assay is performed, in which case pools need only be constructed once for multiple screenings. For a single probe, adaptive strategies generally require fewer assays than non-adaptive strategies, but in repeated screenings for distinct probes a fixed non-adaptive pooling design usually requires many fewer pools overall. In this survey we describe and compare non-adaptive pooling designs for clone library screening experiments. The remainder of the survey is organized as follows: Section 2 gives a brief survey of the literature on 1 In other contexts, positive entities have also been referred to as "defective" (e.g. in the context of quality control) or "distinguished" (in general discussions of group testing).
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 135
adaptive and non-adaptive group testing and its applications. In Section 3, important terminology and concepts are defined and introduced. In Section 4, various efficiency criteria are defined and discussed. Section 5 gives an overview of the different kinds of pooling designs and introduces a small example for comparisons. The next two sections give details on the two major classes of pooling designs: deterministic in Section 6 and random in Section 7. A larger example is considered in Section 8.2. 2. A brief overview of group testing 2.1. General group testing. Group testing problems vary according to a number of factors, such as • the assumptions about number of positive entities, • the constraints on pool sizes, and • the constraints on the testing strategy. The assumptions about the number of positive entities can be combinatorial or probabilistic, and this choice determines which of two general approaches to studying group testing problems is used. In combinatorial approaches [3,10,12,23,24,26,29,39], it is usually assumed that there are exactly j, or at most j, positives. The most common probabilistic approach [4,6,9,27,36] involves assuming that each clone is positive for a probe with a fixed probability, independently of the other clones and probes. Constraints on the pool size depend on the efficiency of the test. Detection efficiency is mentioned by Dorfman [9] in the context of screening samples of blood and considered by Barillot et at. [4] for library screening. Testing strategies can be categorized as adaptive or non-adaptive. (In fact, these labels apply to two ends of a spectrum of possibilities.) The general adaptive strategy consists of constructing one pool at a time, testing it, and then deciding which pool to construct and test next on the basis of all the previous test results. Most theoretical studies of group testing involve finding the best adaptive strategy for a group testing problem. Simple adaptive strategies based on partitioning the entities were introduced in the 1940s [9]. They arise in puzzles such as that of finding a gold coin among copper coins using a balance [38]. General adaptive strategies have been studied in detail: see the recent monograph of Du and Hwang [10] for a thorough survey of the combinatorial approach. Recent successes in the study of adaptive strategies for combinatorial group testing include the determination of essentially optimal strategies for the case where the positive entities must be a pair from a given set of pairs [8], and for the case where there is only one positive entity, but the tests are performed by an adversary who may lie a small, bounded number of times [29]. 2.2. Non-adaptive group testing. Non-adaptive strategies for pooling first arose in a puzzle on finding a number in a range using cards with windows cut out [18,25]. They were subsequently considered for the problem of searching punch card files [26] and have since been studied in de-
136
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
tail [11,13,14,15,16,24,32]. The application to screening problems has also been considered by a number of authors [3,4,6,12]. Non-adaptive strategies can be classified further into strictly non-adaptive, or "one-stage", strategies, in which further tests are never performed, and "trivial two-stage" strategies, in which a second stage may be employed, consisting only of tests of individual entities. One-stage strategies are usually considered from the combinatorial group testing perspective. In combinatorial, one-stage group testing, the problem most often addressed is to determine the minimum number of pools required to guarantee finding the positive entities. If the number of positive entities does not depend on the total number of entities, then one can equivalently determine the maximum number of entitities which can be resolved with a fixed number of pools. In the case that the number of positive entitities is assumed to be at most j, this problem was first posed by Kautz and Singleton [26], who called a pooling design which allowed the positives to be determined "j-separable". If, in addition, the event that there are more than j positive entities can be determined from the pool outcomes whenever it occurs, the design is called "j-disjunctive" [26], "zero-false-drop of order j" [26], or "j-cover free" [16]. Coding theorists call these designs "superimposed codes" [14,26]. Asymptotic bounds on the largest number of entities n that can be accommodated by v pools given up to j positives for these designs are given by Dyachov & Rykov [14]: (2.1)
p log(n) .2 C 1 log(j) ~ v ~ C 2 J log(n).
A further criterion that has been used to evaluate combinatorial nonadaptive pooling designs is error tolerance. Designs which additionally require the detection of up to q errors were introduced in [15] and further discussed in [3]. A more realistic treatment of errors assumes that they occur randomly with a known probability distribution, as described further in the library screening application in Section 4. Trivial two-stage strategies have been considered by Dyachov et at. [15] and Knill [27]. If up to 2j entities are allowed to be tested in the second stage, then the number of pools required is at most C j log( n). 3. Preliminaries. In the non-adaptive library screening problem we are given a collection of clones C of which, for each screening, a subset :I is positive. Information about :I is obtained by testing a set of pools Q. We introduce notations for the cardinalities of these sets: n == ICI, v == IQI, J == 1:11 and N == n-J. These, and other, definitions are displayed in Table 3.l. A non-adaptive pooling design is fully defined by an incidence relationship between the clones and the pools, which can be specified by an n x v incidence matrix I, where Ii,j = 1 if clone i is in pool j, and Ii,j = 0 otherwise. Such a design can also be specified by a family F of subsets
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 137 TABLE 3.1
Definitions of notations used throughout the paper
Symbol n
v k h p c
N N J J
Definition number of clones number of pools number of pools per clone number of clones per pool probability a clone is positive coverage (= np) number of negative clones number of unresolved negatives number of positive clones number of unresolved positives
of the set of pools Q. For each clone i, we write B; for the set of pools containing the clone, so that
:F = {B; : i E C}. After the pools have been assayed, the set of positive pools, Q(.1), is given by Q(.1) = U;E.7 B;.
Broadly speaking, a good non-adaptive pooling design :F has the property that it is usually possible to infer from Q(.1) a large amount of information about.1. If, for some clone i, we have B;\Q(.1) =j:. 0, then we can infer that i is negative (assuming no false negative results). Such clones are called resolved negatives. The remaining clones, that is the clones i satisfying B; ~ Q(.1), are candidate positives. Among the candidate positive clones, those which are not in .1 are called unresolved negatives. The number of unresolved negatives is denoted by N. If a candidate positive clone i occurs in a pool j such that every clone in j other than i is a resolved negative, then i must be positive (assuming no errors) and is called a resolved positive. The remaining candidate positive clones are unresolved positives and the number of these is denoted by J. A good pooling design usually leads to small values for both Nand J. This requirement can be made precise in several distinct ways, which we now proceed to discuss. A k-set is a set with k-elements. A k-subset of a set U is a k-element subset of U. The set {I, ... , v} is denoted by [v]. 4. Efficiency criteria for library screening. Balding & Torney [3] investigate optimal designs for library screening from a combinatorial perspective. They consider designs which allow all the positives to be resolved whenever J ~ j, and also require that the event J > j to be
138
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
distinguished from the pool outcomes whenever it occurs. In addition, these authors require the detection of up to q errors in the pool outcomes. Designs satisfying these conditions are called (j, q)-detectors. A (j, q)-detector is optimal if it maximizes the number of clones n for a given number of pools v. These authors obtain upper bounds on n as a function of q, j, and v, in the cases j = 1 and j = 2. The bounds are achieved in some cases by settheoretic designs having certain regularity properties which are discussed further in Section 6.1. These regularity properties suggest heuristics for random designs which are found to have good properties, as discussed in Section 7.2. For large library screening projects, the definition of optimality based on (j, q)-detectors is usually too restrictive. The number of pools required to guarantee the detection of j positives is at least p log( n) / (210g(j)) for large n [13,15] whereas we will see that designs with satisfactory properties can be constructed with C j log( n) pools. In addition, probabilistic specifications of :J are more realistic for library screening. Following Barillot et al. [4], we will assume in the remainder of this survey that each clone is positive for a probe with a probability p, independently of the other clones and probes. Consequently the number of positive clones has the binomial distribution. Cloning bias can yield a distribution with larger variance than the binomial, but this assumption should nevertheless be adequate for the purposes of design comparison. Typically, the first step in computing the performance of a pooling design is to condition on the value of J, so that the calculations presented here are readily modified to allow an arbitrary distribution for J, provided only that when J = j all j-tuples of clones are equally likely to be the positive ones. In the case of unique markers, we write c for the coverage of the library, so that c = np. For design comparison, we will assume that n and v are fixed and seek designs which optimize one of the following performance measures: 1. The probability that all clones are resolved, P[ N =f=0]. 2. The probability that all negative clones are resolved, P[N=O]. 3. The expected number of unresolved positive clones, E[fI4. The expected number of unresolved negative clones, E[N]. Dyachov [ll] considered p[N=f=O]. It is a natural performance measure for non-adaptive group testing since it directly addresses the desirability of avoiding a second stage. Instead of P[N =f=0], it may be preferable to consider P[N =0], since the latter provides an upper bound on the former and is much easier to calculate. If there are no unresolved negative clones then unresolved positives are usually very rare and hence P[ N=0] provides a tight bound on p[N=f=O]. Design comparison based on E[l] is introduced in [6]. It is a useful performance measure for one-stage designs when the experimenter wishes efficiently to resolve as many positive clones as possible in one pass, but does not insist on finding all positive clones. This might be the case, for
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 139
example, in complete digest libraries for which, in our experience, not all of the positive clones are required for each probe, at least initially. Several designs are compared by [4] using E[N]. This performance measure is suitable for analysis of the trivial two-stage design in which all or most candidate positives are screened in the second stage to confirm their status. The expected number of candidate positives is c + E[N] and hence the expected number of assays in the second stage is minimized when E[N] is minimized. Even if resolved positives are not confirmed in the second stage, the size of the second stage is usually dominated by E[N]. We will see in Sections 7 and 8 that the choice of efficiency criterion is important: the optimal parameter values can vary substantially according to the choice of criterion. Designs which maximize P[N =1=0] or P[N =0] generally perform well when J is not too large compared with c, but perform poorly when J ~ c. Furthermore, the deterioration of performance when J > c is sensitive to the design's parameters[6]. In many experimental protocols, if two designs are otherwise equivalent, the one with the fewer clones per pool may be preferred. Experimental considerations may even rule out designs for which this parameter, which we call h, is beyond a given bound. Because there seem to be no fixed rules here, we do not explicitly consider the number of clones per pool in our comparisons, but many of the designs described here are flexible in this regard. Finally we note that the efficiency criteria discussed here depend on the notion that the status of a clone can be unequivocally determined. Under realistic assumptions about experimental error, however, unequivocal resolution of clones may not be possible. Instead, clones can be assigned posterior probabilities of being positive, based on both the pool outcomes and the prior model for the distribution of positives. Such probabilistic "decoding" is introduced and discussed in [6]. Natural efficiency criteria for designs in this situation would be based on the increase in information in going from prior to posterior joint distributions for the positive clones. 5. Overview of non-adaptive pooling designs. Pooling designs can be classified loosely into two categories: deterministic and random. Deterministic designs are specified explicitly, whereas random designs are not unique designs but methods for design construction which, in effect, specify a probability distribution over a set of designs. The performance of a particular realization of a random design may thus vary from the expected performance calculated below. It follows that random designs can be improved by obtaining several realizations, computing the performance of each, either directly or by simulation, and choosing the best realization. For large designs, however, the realized performance may vary only slightly from the expectation. 5.1. Overview of deterministic designs. The deterministic designs to be considered are
140
0.1. BALDING, W.l. BRUNO, E. KNILL, AND D.C. TORNEY
• set packings, • hypercube designs, and • transversal designs. Set packings for pooling designs were introduced by Kautz & Singleton [26]. In these designs, each clone occurs in precisely k pools and no two clones appear together in more than t pools. By choosing k and t appropriately, designs with very good properties can be obtained. In many cases, a packing design is optimal with respect to at least some of the efficiency criteria. In general, reducing t improves the performance of the design, but increases the computational cost of constructing the packing. There exist no general methods for constructing maximum-size packings, but some particular examples are known [2,5,22]. Hypercube designs are simple designs based on a geometric layout of the clones. To obtain the simplest such designs, "row and column" designs, one arrays the library on a rectangular grid and constructs pools from the columns and rows of the grid. In practice most libraries are already arrayed on a number of rectangular plates, and the pools are formed from the plates as well as the combined columns and rows. More generally, the library can be arranged on a grid in any number of dimensions, and the pools are formed from the the axis-parallel hyperplanes. Although hypercube designs are simple to construct (special plasticware is available for the purpose in most laboratories), their performance is far from optimal, and in practice additional pools need to be constructed to improve their performance. Transversal designs are generalizations of hypercube designs which have been studied in connection with problems arising in coding theory. The general transversal design is obtained by partitioning the pools into d parts such that the pools in each part form a partition of the clones. In the case of row and column designs, the pools are partitioned into d = 2 parts, the "row" pools and the "column" pools. Transversal designs are readily constructed: several general constructions are available based on finite geometry. In addition, transversal designs are easily converted from nonadaptive to adaptive strategies, by testing the parts of the pool partition in stages. Many of the combinatorially constructed transversal designs are relatively easy to implement using plasticware. Transversal designs can also be made to satisfy packing restrictions to guarantee certain performance requirements. 5.2. Overview of random designs. The random designs to be considered are • random incidence designs, • random k-set designs, and • random subdesigns. Random incidence designs are obtained by allocating clones to pools with fixed probability, independently for each clone-pool pair. The properties of random incidence designs are easy to calculate and they are useful for
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 141 TABLE 5.1
Comparison of design performance for a small example. The number J of positive clones is assumed to be binomially-distributed with parameters nand p. The number of clones per pool, h, is fixed for deterministic designs while expectations are shown for random designs. The expected number of positive clones np is approximately 1.3 when p 0.02 and 2.7 when p 0.04.
=
=
Type of Design Packing (Figure 1) Rowand Column (8 x 8) Transversal (degree 2 over GF(4)) Random Incidence (r = 0.40 and 0.35) Random k-sets (k = 4)
17
n 68
h 20
16
64
8
16
64
16
17
68
17
68
27 24 16
v
p
P[N=O]
E[N]
E[J]
0.02 0.04 0.02 0.04 0.02 0.04 0.02 0.04 0.02 0.04
0.86 0.51 0.69 0.33 0.80 0.47 0.51 0.19 0.69 0.34
1.07 6.10 1.09 3.80 1.15 5.19 4.24 12.0 1.39 5.86
0.40 1.78 0.81 2.24 0.57 2.08 0.62 2.13 0.54 1.92
theoretical comparisons, but they generally make inefficient pooling designs and are rarely used in practice. Random k-set designs are obtained by choosing k pools randomly and independently for each clone [6]. These designs are easy to construct and perform well: their performance is close to that of the best known design in all cases considered. The choice of k can be optimized for all n, v and c, and can also allow for error tolerance as well as constraints on h. The performance of these designs can be further improved by enforcing packing constraints. Random sub designs are obtained by starting with any pooling design on n' > n clones and choosing a random n-subset of the clones. The pools ofthe sub design are the same as those ofthe original design, except that the non-selected clones are removed. This construction is useful in cases where a good design, a little larger than needed, is available, and the performance of such designs is close to that of the superdesign. 5.3. A small example. We illustrate the performance of the designs with a small example in Table 5.1. The number of clones and pools is 68 and 17, respectively, except for the hypercube and transversal designs for which parameter values are restricted and we have chosen values as near to 68 and 17 as possible. This example is discussed in the context of particular designs in Sections 6 and 7. A larger example is discussed in Section 8.2.
142
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
6. Detenninistic designs 6.1. Set packings. A (v, k, t)-packing is a collection F of k-subsets of [v] such that no two distinct members of F have more than t elements in common. If the pools and clones are identified with the members of [v] and F, respectively, then the corresponding pooling design is such that each clone occurs in precisely k pools and any two clones coincide in at most t pools. Consider the design corresponding to a (v, jt+q+ 1, t)-packing. Each clone occurs in at least q+ 1 pools which are not in the union of the pools of any j other clones and it follows that this design is a (j, q )-detector on v pools (see Section 4). The size n of a (v, k, t)-packing is bounded by Ao, defined by the recursive formula At+1 = 1, and (6.1)
v-s
= LAs+1 k-s J, part of x and s = 0, 1, ... , t, As
where Lx J is the integer [5]. The bound Ao cannot be achieved in general, but is asymptotically correct for large v and constant k and t [31]. If every t-subset of [v] is contained in precisely At elements of the packing, then
and the packing is called at-design [5]. If, further, At = (v-t)/(k-t) then
In this case each (t+ 1)-subset of [v] is contained in precisely one element of the packing, which is called a Steiner system [5]. The construction of maximum-size (v, k, t)-packings for appropriate choices of k and t is difficult in general, and the maximum achievable size is generally not known. However, due to their importance in many other experimental design applications, explicit constructions of numerous t-designs and Steiner systems have been documented in the literature [2,5,22]. These are mainly limited to the case t < 3, but include many designs useful for actual clone libraries. The parameters of some specific Steiner systems of interest for pooling clone libraries are given in Table 6.1. These designs are either Steiner triple systems or derived from one of several constructions based on finite fields (see [5] for details). The incidence matrix of the first Steiner system in the table is displayed in Figure 6.1. Research is continuing into heuristic methods for computing approximate maximum-size packings. Random selection and rejection is sometimes a computationally-feasible approach [37]. Consider a (v, k, t)-packing F whose size achieves the upper bound Ao defined at (6.1). In particular, t-designs and Steiner systems achieve this
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS
143
TABLE 6.1
The expected number of unresolved negative clones, E[N], for some Steiner system pooling designs.
n
I
J
I E[N] I
1 17
1 5
I r 2·1
h
68
20
520
I
I
I2 I
72
2 3 4 4 5 8 4 6 9 2 3 5 2 5 8 4 20 30 2 5 11 13 32 38 1 2 4 16 41 48 2 5 12
0 3.58 11.35 0 0.25 12.25 0 0.32 11.15 0 0.56 10.64 0 0.73 12.96 0 0.86 6.47 0 0.751 11.47 0 0.912 11.21 0 0.41 15.25 0 0.94 11.63 0 0.58 11.95
738 1
kit
V
65
9
82 1101 21 90
4368
-I
65
1 5 121 336
16275
I
126
I
6
I2 I
775
19500
I
625
I
5
I1I
156
19551
I
343 1 3 111 171
19710
I
730
221401
I 28 I 2 I
82 1 4
12
756
r 1080
32800
I 1025 I 33 I 2 T1056
33227
T 447 1
3 11
r
223
144
. .. • • .... .. ...... . .. . ........ - .. .. ...... • ... . ... • .... •••••. .. • • • .. • ... . ..... . ... ........ .. •.... ... ••• .. . . .... •• .• .. • .. ... . ......... . ... . ... .. .. .. . . .. .. • ..... ... . .... .. ... . • ..... •. •• ..• • ... • .. .·• ..... ... . .. .. . . .. · .-- .-- D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
••
•
•
•• •
• ••
••
•
••
•
••
•
•
•
•
• •••
•
••••••••••••••• ••••••••••••••••••••
•••••••••••
••••••••
FIG. 6.1. Steiner system pooling design on n = 68 clones and v = 17 pools. A "." in row i and column j indicates that clone j occurs in pool i.
bound. These designs have the property that every s-set, with s :S t, occurs in precisely A, members of the packing, where As is also defined at (6.1). Using this property, E[N] can be computed by using the inclusion-exclusion principle twice. Consider first the k pools in which an arbitrary negative clone occurs. The number J-ti of clones that occur in none of a given i-subset of these k pools is given by
(6.2)
min(i,t) ( . )
J-ti
= ~
: (-I)'(As-I),
O:Si:Sk.
Given i positive clones, the probability, T!,1), that a specific m of the k pools are positive is then
(6.3) The value of E[N] is then obtained by convolving T~j) with the assumed binomial distribution for J:
(6.4)
E[N]
= ten-i) (~)pi (1-pt-jT~j). J
j=O
Alternatively, we can use inclusion-exclusion to determine the probability that all k pools in which a given clone occurs are positive and hence obtain
(6.5)
E[N] = n(1-p)
t
.=0
e)(-1)i(1-pti,
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 145
in which Wi = po - Pi is the number of other clones in any of i specific pools in which a given clone occurs.
6.2. Hypercube designs. A commonly used pooling scheme consists of pools made from the rows and columns of a 96-well microtitre plate. Twenty pools are constructed, one corresponding to each of the 8 rows and 12 columns. If only one clone on the plate is positive, then one row pool and one column pool will be positive and thus the location of the positive clone is uniquely determined. This method has proven to be sufficiently practical that plasticware for implementing this scheme using a 96-tip pipette is widely available. Rowand column pooling schemes provide guaranteed one-pass solutions only when there are either zero or one positive clones on the plate. For example, with two positive clones, there will usually be two positive rows, and two positive columns, and therefore four candidate positive clones which must be screened in a second stage. This scheme also performs poorly in the presence of errors. If a column pool fails, for example, there are 12 candidate clones in the row that might be positive. For comparison with the Steiner system design of Figure 6.1 consider an eight-by-eight row and column scheme on n = 64 clones and v = 16 pools. The number of clones in each pool is eight, fewer than the 20 clones per pool for the Steiner system. As remarked above, the detection of one positive is guaranteed, but with no error-detection. For J > 1, however, with high probability no positive clones are resolved and hence E[J] ~ J. Rowand column designs are thus inappropriate if it is desired to resolve many positives in one pass, without a second stage. Assuming no errors, the values of E[N] when J = 2 and J = 3 are, respectively, 1.5 and 4.0. Some values of the performance measures in the binomial case are given in Table 5.1. For large J, the row and column design performs better than the other designs, but all perform poorly. When J = 6, for example, an average of 25% of the negative clones are left unresolved by the row and column design, compared with 49% for the Steiner system. The relative advantage of the row and column design for large J can be ascribed to its having fewer clones per pool than the other designs. Row-and-column designs can be generalized by assigning the clones to the points of a finite d-dimensional integer lattice (for convenience, we consider here only the symmetric lattice for which each coordinate takes values in [I] [4]). A pool contains all clones having a particular value of a coordinate. The number of clones is n ld, there are v dl pools, and the number of clones in each pool is h = ld-l (with d and I both integers greater than 1). When d = 2, the hypercube design is a square row and column design. For d-dimensional hypercube designs we have P[N=J=O] = prj ::; 1]
=
=
146
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
and given j positive clones
In general, hypercube designs perform poorly according to the P[N =J =0] criterion, but they can perform well in terms of E[N]: a second stage will almost always be required but it will often be small. Disadvantages of hypercube designs include sensitivity to false negatives and inflexibility: for a given n there is usually only one suitable hypercube design and the experimenter cannot adjust the parameters of the design to modify the performance. To improve the properties of hypercube designs, Barillot et al. [4] proposed constructing additional hypercube configurations "as different as possible from the previous ones", using appropriate linear transformations of the coordinates. The resulting design is a special case of what we call a transversal design. 6.3. Transversal designs. If a design has the property that the pools consist of several partitions of the clones, we call it a transversal design. More precisely, a general transversal design consists of a partition of the pools into k parts, P l , ... , Pk, such that the pools in each part form a partition of the clones. Consequently Bi, the set of pools containing clone i, satisfies IBi n Pj I = 1 for 1 ::; j ::; k, and hence each clone appears in exactly k pools. A subset B of the pools which satisfies IB n Pj I ::; 1 for 1 ::; j ::; k is called a transversal subset. Transversal designs are frequently used in practice, particularly the hypercube designs. There are several reasons for this popularity. First, it is usually possible to find a suitable transversal design for a given nand c. Second, the performance of transversal designs can be good, usually comparable to that of random k-sets designs. Third, transversal designs can be used in both a non-adaptive and an adaptive mode. In the adaptive mode, the parts P l , ... , P k of the pool partition are tested in stages. This adaptive strategy for pooling can be classified as a hierarchical interleaved strategy and has been implemented for several libraries [28]. Fourth, the implementation of transversal pooling schemes is often simplified by the ability to choose the first few sets of pools to be aligned with the layout of the library. For example, one can choose each plate to be a single pool for Pl. Such pools can be quickly constructed using standard plasticware. Finally, if care is taken in the choice of the partitions, decoding is simplified and can sometimes be achieved manually. We briefly describe four approaches to constructing transversal designs for pooling. The first is to use the layout of the library to obtain several partitions which are easy to construct. Sometimes the library is duplicated and the copy arrayed differently to obtain sufficiently many partitions. The
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS
147
rearraying needs to be done carefully to avoid large overlaps between the pools of different clones. This has been done for the CEPH MegaYAC library [7] at Los Alamos National Laboratory, using rearrangements of the plates. The second approach is closely related to the first but employs linear algebra constructs to attain good packing properties. The clones are identified with vectors Ui in the finite vector space V = GF(q)d, where q is a power of a prime and GF(q) is the finite field with q elements. Each Pj consists of pools Pj,x where x E GF(q). For each j, a vector Vj is chosen and Pj,x consists of the clones i for which Ui . Vj = x. Here· denotes the inner product in V. Each Vj can be viewed as a linear rearrangent of the hypercube, as in [4]. The choice of Vj is crucial for obtaining a good design. As discussed by Barillot et al. [4], d-wise linear independence ofthe vectors Vj is useful. Suppose one can arrange that every possible choice of d vectors Vj are independent. This implies that no two distinct clones i l and i2 can coincide in d or more pools. To see this, suppose without loss of generality that i l and i2 are in the same pool in parts PI, ... Pr . This means that Ui 1 • Vj = Ui 2 • Vj for 1 :::; j :::; r. However, the dimension of the affine subspace of solutions w to Ui . Vj = W . Vj (1 :::; j :::; r) is max(O, d-r), by the independence assumption. This implies that if r 2: d then Ui 2 = Ui 1 • The problem of finding the maximum number, l, independent vectors in V has been studied in the context of finite projective geometry. When q is odd, 1 :::; max(q+1, d+1), otherwise I :::; max(q+2, d+1) [21]. There are simple constructions based on the Vandermonde matrix which can usually achieve these bounds. The third approach to constructing transversal designs uses the observation that the set Bi of pools containing clone i can be considered as the graph of a function. More formally, relabel the pools such that pool (j, m) is the mth pool of Pj. If i belongs to the kjth pool in part Pj, then the pools of j are (labeled by) the pairs (j, kj), which is the graph of the function j -> kj . The pool design is optimized by ensuring that the functions overlap as little as possible. The simplest method for getting good pool designs is to choose graphs corresponding to sets of polynomials over GF(q) of bounded degree with domain restricted to a subset of G F( q). It can be shown that this construction is very closely related to the Vandermonde construction referred to in the previous paragraph. The fourth approach comes from coding theory. It is based on making the association between transversal sets or graphs of a function to a code over a general alphabet. In particular, a code word of length n over the qletter alphabet directly corresponds to a function from [n] to [q] and hence to a graph of a function. Good error correcting codes satisfy the property that the overlaps of the graphs of distinct codewords is small, which is a desirable property for pooling designs. For the regular transversal designs (also known as orthogonal arrays)
148
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
obtained using the second or third approach, E[N] can be computed using the methods discussed for t-designs. These designs have d disjoint sets of pools each with I elements and satisfy that for some t < d every transversal t-set is covered exactly once. This implies that for i t, the number of clones which occur in each member of a transversal i-set is given by Ai = tt-i. With Ai = It-i and k = d, formulas (6.2), (6.3) and (6.4) also hold for the regular transversal designs.
s:
7. Random designs 7.1. Random incidence designs. A simple, random pooling design consists of fixing a probability r and, given any clone and any pool, assigning the clone to the pool with incidence probability r, independently of the other clones and pools. The value of r can be chosen to optimize a selected design performance measure. Suppose that there are j positive clones. Then the number of negative pools is binomial with parameters v and (l-r)j. The probability of resolving all negative clones and the expected number of unresolved negative clones are therefore
and
(7.2)
Ej [N] = (n-j) (1 _r(l-r)j)
v
The value of Ej[N] is minimized when r = 1/(j+1). The value of E[i] is computed using the same approach given in [6] for random k-set designs. Equations (7.1) and (7.2) can be used to compute P[N=O] and E[N] by convolving with the binomial distribution of positive clones. In the example of Figure 6.1, n = 68, v = 17 and p = 0.02. For a random incidence design with these parameters, the maximum value of P[N =0] is 0.55 which occurs at r = 0.53 while the minimum value 3.6 for E[N] is attained at r = 0.31. At an intermediate value r = 0.40, we have P[N=O] = 0.51 and E[N] = 4.24. When p = 0.04, the P[N=O] and E[N] attain optimum values 0.23 and 9.7 at r = 0.50 and r = 0.24, respectively. At r = 0.35, we have P[N=O] = 0.19 and E[N] = 12.0. Random incidence designs therefore perform poorly in this case in comparison with the design of Figure 6.1. If we require that E[N] a for a fixed number of positives, j, then from (7.2) with r = 1/(j+1) we obtain the bound
s:
loge a) - loge n)
(7.3)
v
~ log(l- ji(j+1)-j-l)·
which, for j large, is close to
(7.4)
v
~ (j+1)e(IOg(n) -IOg(a)).
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 149
In general, random incidence designs are much less efficient than alternative designs (including random k-sets, discussed below) and have no guaranteed error detection. Because of the computational tractability of their properties, however, random incidence designs have theoretical uses in, for example, establishing a bound on the size of (j, q)-detectors using the "probabilistic method" [13,15].
7.2. Random k-set designs. The known optimal (j, q)-detectors have the property that every clone occurs in a fixed number of pools. This observation suggests that designs superior to random incidence designs can be obtained by restricting the randomization to satisfy this constraint. One thus obtains a "random k-sets" design: the pools in which a clone occurs are selected uniformly randomly among all possible choices of precisely k pools, independently of the other clones, with k chosen to optimize a selected performance measure [6]. For several measures it can be shown that random k-sets perform better than random incidence designs or random transversal designs. For example, the bounds on the size of (j, q)-detectors computed in [15] are better than the bounds obtained using the other random designs. It can also be shown that E[N] is strictly smaller for random k-sets then for the corresponding random incidence designs. Given exactly j positive clones, the probability, Kg), that m specified pools are precisely the negative pools, 0 ~ m ~ v-k, can be obtained using the inclusion-exclusion principle [6]: (7.5) Alternatively, Kg) can be obtained from the recursive formula
~ (~- m) ( v~i )Kf (V)-l k ~ z-m k-z+m
i-
(7.6)
1 ),
I=m
in which
KiD) = 1 if i = v, otherwise KfD) = O. Therefore, given J = j,
Similarly, (7.8) in which LW,,) denotes the probability that m specified pools are all positive,
o ~ m ~ v, so that (7.9)
150
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
and also (7.10)
LCi) m
=~ ~ i=O
(m) (v-m) (v) i
k-i
k
-1 LCi-1)
m-' ,
in which L~) 1 and L~) = 0 for m :::: 1. The calculation of E[l] for random k-sets designs was described in [6]. Consider again the example of Figure 6.1. For v = 17, n = 68 and p = 0.02, the maximum value of PIN=O] is 0.70 which is attained when k = 5. The minimum value of E[N] is 1.37 which occurs at k = 3. At the intermediate value k = 4 we obtain P[N=O] = 0.69 and E[N] = 1.39. When p = 0.04 and k = 4, we have P[N=O] = 0.34 and E[N] = 5.86 (with an optimal value for P[N=O] of 0.35 at k = 5 and for E[N] of 5.08 at k = 3). This small example illustrates properties that seem to hold generally for random k-sets designs: • the optimum value of k varies markedly according to the performance criterion selected; • for a given performance measure, the optimum value of k varies slowly with assumptions about J; and • the values of the performance measures vary slowly with k in the vicinity of the optimum. A random k-set design is a (v, k, k)-packing (defined in Section 6.1). By rejecting sets which share more than t elements with another set in the design, a (v, k, t)-packing is obtained. It is always easy and sensible to reject any k-set which already occurs in the design, hence improving the (v, k, k)-packing to a (v, k, k-1)-packing (note that the probability of having to reject a k-set is usually very small). In general, reducing the value of t improves the performance of a design, but increases the difficulty of constructing a sufficiently large design. 7.3. Random subdesigns. Suppose a pooling design with n' clones is required and an excellent design with n > n' clones is already available. A simple algorithm is to assign the n' clones to a randomly chosen n' of the n sets. If n' is much smaller than n, then it is usually a good idea to consider alternative designs. However, if n' is comparable with n, a random sub design may be useful and the performance of the sub designs will be closely related to that of the original design. Given j positives in the original design, the expected number of unresolved negatives for the random sub design is (n'- j)Ej [N]/( n-j), with Ej [N] the expected number of unresolved negatives in the original design. The probability of a one-pass solution and the probability of no unresolved negatives given j positives is at least as high for the random sub design as it is for the original design. 8. Examples 8.1. A small example: n = 68. Of the five designs compared for a small example in Table 5.1, the packing design (the Steiner system of
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS 151
Figure 6.1) is best in all but one of the performance measures displayed. The random 4-sets design also performs well. Certainly, the random 4-sets design is superior to random incidence, and random k-sets can beat the row and column design provided k = 3 is used when p = 0.04. We will see in Section 8.2 that the random k-sets design fares well in quantitative comparisons for a large design, and their ease of construction may make them (or modifications of them discussed below) the most suitable choice for many library screening projects. 8.2. A larger example: n = 1298. A 2.5-cover library of 1,298 clones was discussed in [6]. A 3-dimensional cube design on 1,331 clones can be constructed using 33 pools. The value of P[ N=0] for this design is 0.29 while E[N] is 22. Comparisons with random designs are displayed in Table 8.1. For the random incidence design, the optimum values of 0.32 and 44 for P[N=O] and E[N] are attained at r = 0.46 and r = 0.20 respectively, highlighting the substantial effect of the choice of efficiency criterion. Table 8.1 describes the properties of an arbitrary compromise at r = 0.30. For the random k-sets, P[N =0] and E[N] take optimal values 0.48 and 15 at k = 9 and k = 4, respectively. An arbitrary compromise, k = 6, is described in Table 8.1. The random 6-sets design performs best among the three considered in this case, although the cubic design has fewer clones per pool. An additional advantage enjoyed by the random designs is flexibility: any value of v can be specified. In particular, the value v = 47 is convenient because two sets of pools can be accommodated on a single 96-well microtiter plate, allowing for two controls. A (47,4,2)-packing design on 1,298 clones was implemented because it is a good compromise for effective determination of positives for small j while still giving information in the case where j 2 6 [6]. It also gives a small value for the number of clones per pool, which can make screening more reliable. This design is similar to a random 4-sets design except that it was modified so that the number of clones per pool was nearly constant and that a packing constraint was enforced: no two clones occur together in more than two pools. This design has P[N=O] = 0.54 and E[N] = 4.0, compared with P[N=O] = 0.48 and E[N] = 4.7 for the random 4-sets design. The computational cost of using additional heuristics such as packing constraints thus seems worthwhile. Further comparisons are displayed in Table 8.1. A random 7-sets design is superior to the random 4-sets in terms of both P[N=O] and E[N], at the cost of increasing the number of clones per pool. Further improvement is obtained by employing a (47,7 ,3)-packing. 9. Synopsis. We have described a range of pooling designs and assessed their performance using several criteria. Since we are motivated primarily by the application to efficient screening of large clone libraries, we have restricted attention to non-adaptive designs, which are most useful in that application. We have attempted to be thorough within this cate-
152
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY TABLE 8.1
Comparison of design performance for a larger example.
Type of Design Random Incidence (r 0.3) Cube (l 11) Random 6-sets Transversal (G F( 11)) Random 4-sets (47,4,2)-packing Random 7-sets (47,7,3)-packing
=
=
v
33 33 33 44 47 47 47 47
n 1,331 1,331 1,331 1,331 1,298 1,298 1,298 1,298
c
2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5
h P[N=O] E[N] E[J] 61 1.90 0.24 399 121 0.29 20 2.28 242 0.44 18 1.60 121 0.51 6.2 1.54 110 0.48 4.7 1.14 110 0.54 4.0 1.03 193 0.65 4.2 0.77 193 0.69 4.0 0.74
gory, discussing both random and deterministic designs including all that seem to have been employed in practice and others that may potentially be useful or are of theoretical interest. The performance criteria considered include the expected numbers of unresolved positive and unresolved negative clones and the probability of resolving all of the negative clones. The discussion has been illustrated by two specific examples, one small and one of intermediate size. We believe that substantial efficiency gains are often possible over designs currently implemented. In our view, random k-sets designs, with some packing constraints imposed, are likely to provide the best designs for most library screening problems, offering flexibility, efficiency and ease of construction. Transversal designs can also have good properties and may be preferred in some cases. The goals of, and constraints on, library screening vary from one project to the next, so that no one approach can be asserted to be globally optimal. Although experimental errors are important in practice, to simplify the design comparison we have ignored the possibility of error in much of the discussion. In practice, confirmatory screenings are often employed to weed out false positives and a small rate of false negatives may not pose a major difficulty. However, the extension of this work to include a comprehensive model for errors is an important project for further research. Acknowledgements. DCT is grateful to the UK SERC for a Visiting Fellow research grant GR/J05880. He thanks the School of Mathematical Sciences at Queen Mary & Westfield College, for their hospitality. WJB acknowledges a US Department of Energy Distinguished Human Genome Postdoctoral Fellowship. This work was performed under the auspices of the US Department of Energy, and was funded both through the Center for Human Genome Studies at Los Alamos and by a LANL Laboratory-
A COMPARATIVE SURVEY OF NON-ADAPTIVE POOLING DESIGNS
153
Directed Research and Development grant. DJB acknowledges a grant from the UK Science and Engineering Research Council GR/F 98727. REFERENCES [1) Amemiya, C. T., Alegria-Hartman, M. J., Aslanidis, C., Chen, C., Nikolic, J., Gingrich, J. C., and de Jong, P. J. (1992) A two-dimensional YAC pooling strategy for library screening via STS and Alu-PCR methods. Nucl. Acids Res. 20 2559-2563. (2) Anderson,1. (1990) Combinatorial Designs: Construction Methods. Ellis Horwood, Chichester. (3) Balding, D.J. and Torney, D.C. (1996) Optimal pooling designs with error detection. J. Comb. Th. A 74 131-140. (4) Barillot, E., Lacroix, B., and Cohen, D. (1991) Theoretical analysis of library screening using a n-dimensional pooling strategy. Nucl. Acids Res. 19 62416247. [5) Beth T., Jungnickel D. and Lenz H. (1986) Design Theory. Cambridge University Press, Cambridge. [6] Bruno, W.J., Knill, E., Balding, D.J., Bruce, D.C., Doggett, N.A., Sawhill, W.W., Stallings, R.L., Whittaker, C.C. and Torney, D.C. (1995) Efficient pooling designs for library screening. Genomics 26 21-30. [7) Chumakov, 1., Rigault, P., Guillou, S., Ougen, P., Billaut, A., Guasconi, G., Gervy, P., LeG all , 1., Soularue, P., Grinas, L., Bougueleret, L., Bellane-Chantelot, C., Lacroix, B., Barillot, E., Gesnouin, P., Pook, S., Vaysseix, G., Frelat, G., Schmitz, A., Sambucy, J.-L., Bosch, A., Estivill, X., Weissenbach, J., Vignal, A., Reithman, H., Cox, D., Patterson, D., Gardiner, K., Hattori, M., Sataki, Y., Ichikawa, H., Ohki, M., Le Paslier, D., Heilig, R., Antonarakis, S., and Cohen, D. (1992) Continuum of overlapping clones spanning the entire human chromosome 21q. Nature (London) 359 380-387. [8] Damaschke, Peter (1994) A tight upper bound for group testing in graphs. Discrete App. Math. 48 101-109. (9) Dorfman R. (1943) The detection of defective members of large populations. Ann. Math. Statist. 14 436-440. [10] Du, D. Z. and Hwang, F. K. (1993) Combinatorial Group Testing and Applications. World Scientific Publishing, Singapore. [11) Dyachov, A.G. (1979) Bounds on the error probability for certain ensembles of random codes (Russian). Problemy Peredachi Informatsii 15 23-35. [12) Dyachov, A. G. (1979) Error probability bounds for two models of randomized design of elimination experiments (Russian). Problemy Peredachi Informatsii 15 17-31. [13] Dyachov, A. G. and Rykov, V. V. (1980) Bounds on the length of disjunctive codes (Russian). Problemy Peredachi Informatsii 18 7-13. [14] Dyachov, A. G. and Rykov, V. V. (1983) A survey of superimposed code theory. Problems of Control and Information Theory 12 1-13. [15] Dyachov, A. G., Rykov, V. V. and Rashad, A. M. (1989) Superimposed distance codes. Problems of Control and Information Theory 18 237-250. [16) Erdos, P., Frankl, P. and Furedi, Z. (1985) Families of finite sets in which no set is covered by the union of r others. Israel J. Math. 51 79-89. [17) Evans, G. A. and Lewis, K. A. (1989) Physical mapping of complex genomes by cosmid multiplex analysis. Proc. Natl. Acad. Sci. USA 86 5030-5034. [18) Gerardin, A. (1916) Sphinx-Oedipe 11 68-70. [19) Gille, C., Grade, K. and Coutelle, C. (1991) A pooling strategy for heterozygote screening of the AF508 cystic fibrosis mutation. Hum. Genet: 86 289-291. [20) Green, E. D. and Olson, M. V. (1990) Systematic screening of yeast artificial chromosome libraries by the use of the polymerase chain reaction. Proc. Natl.
154
D.J. BALDING, W.J. BRUNO, E. KNILL, AND D.C. TORNEY
Acad. Sci. USA 87 1213-1217. (21) Hirschfeld, D. R. H. and Thal, J. A. (1991) General Galois Geometries. Clarendon Press, Oxford. (22) Hughes, D. R. (1965) On t-designs and groups. Am. J. Math. 87761-778. (23) Hwang, F. K. (1984) Robust Group Testing. J. of Quality Technology 16189-195. (24) Hwang, F. K. and S6s, V. T. (1987) Non-adaptive hypergeometric group testing. Stud. Sci. Math. Hung. 22 257-263. (25) Kraitchik, M. (1930,1953) Mathematique des jeux ou recreations mathematiques (Mathematical recreations). Dover Publications (2nd rev. ed.), New York. (26) Kautz, W. H. and Singleton, R. C. (1964) Nonrandom binary superimposed codes. IEEE Trans. Inf. Theory 10 363-377. (27) Knill, E. (1995) Lower bounds for identifying subset members with subset queries. Proceedings of the sixth annual Symposium on Discrete Algorithms, 369-377. (28) McCormick, M.K., Buckler, A., Bruno, W.J., Campbell, E., Shera, K., Torney, D., Deaven, L., and Moyzis, R. (1993) Construction and Characterization of a YAC Library with a Low Frequency of Chimeric Clones from Flow Sorted Human Chromosome 9. Genomics 18 553-558. (29) Muthukrishnan, S. (1993) On optimal strategies for searching in the presence of errors. Proc. 5th Annual A CM-SIAM Symp. on Discrete Algorithms 680-689. (30) Olson, M., Hood, L., Cantor, C., and Botstein, D. (1989) A common language for physical mapping of the human genome. Science 245 1434-1435. (31) Rodl, V. (1985) On a packing and covering problem. Europ. J. Combinatorics 5 69-78. (32) Ruszink6 M. (1994) On the upper bound of the size of the r-cover-free families. J. Comb. Th. A 66 302-310. (33) Schneider, H. and Tang, K. (1990) Adaptive procedures for the two-stage grouptesting problem based on prior distributions and costs. Technometrics 32 397405. (34) Simon, R. J., Kania, R. S., Zuckermann, R. N., Huebner, V. D., Jewell, D. A., Banville, S., Ng, S., Wang, L., Rosenberg, S., Marlowe, C., Spellmeyer, D. C., Tan, R., Frankel, A. D., Santi, D. V., Cohen, F. E., and Bartlett, P. A. (1992) Peptoids: A modular approach to drug discovery. Proc. Natl. Acad. Sci. USA 89 9367-9371. (35) Sloan, D. D., Blanchard, M. M., Burough, F. W., and Nowotony, V. (1993) Screening YAC libraries with robot-aided automation. Gen. Anal. Tech. Appl. 10 128-143. (36) Sobel, M. and Groll, P. A. (1959) Group testing to eliminate efficiently all defectives in a binomial sample, Bell System Tech. J. 28 1179-1252. (37) Stinson, D. R. (1985) Hill-climbing algorithms for the construction of combinatorial designs. Ann. of Discr. Math. 26 321-334. (38) Ulam, S. (1977) Adventures of a Mathematician. Scribners, New York. (39) Wolf, J.K. (1985) Born again group testing: multiaccess communications. IEEE Trans. Inf. Theory IT-31 185-191.
PARSING OF GENOMIC GRAFFITI CLARK TIBBETTS·, JAMES GOLDEN, III·, AND DEBORAH TORGERSEN"
1. Introduction 1.1. DNA sequences and the Human Genome Project (HGP) A focal point of modern biology is investigation of wide varieties of phenomena at the level of molecular genetics. The nucleotide sequences of deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) define the ultimate resolution of this reductionist approach to understand the determinants of heritable traits. The structure and function of genes, their composite genomic organization, and their regulated expression have been studied in systems representing every class of organism. Many human diseases or pathogenic syndromes can be directly attributed to inherited defects in either the regulated expression, or the quality of the products of specific genes. Genetic determinants of susceptibility to infectious agents or environmental hazards are amply documented. Mapping and sequencing of the DNA molecules encoding human genes have provided powerful technology for pharmaceutical bioengineering and forensic investigations. From an alternative perspective, we may anticipate that voluminous archives of singular DNA sequences alone will not suffice to define and understand the functional determinants of genome organization, allelic diversity and evolutionary plasticity of living organisms. New insights will accumulate pertaining to human evolutionary origins and relationships of human biology to models based on other mammals. Investigators of population genetics and epidemiology now exploit the technology of molecular genetics to more powerfully probe variation within the human gene pool at the level of DNA sequences. Governmental and private sector agencies are supporting characterization of the human genome, and the genomes of model organisms, at the level of coherent expanses of chromosomal DNA sequences. Joint working groups of the National Institutes of Health and Department of Energy have examined the scope of the Human Genome Project (HGP) and established scientific goals for the first five year period (NIH & DOE, 1990). This direction is complemented by the planning and resources of international consortia such as HUGO. The Human Genome Project not only sets a groundwork for molecular biology and genetics of the future, it also foreshadows a larger if more diffuse initiative that has been labeled by some as the Human Genome Diversity Project. The immediate goals of the HGP include extensive physical and genetic mapping of human and model organism genomes. As the mapping • Vanderbilt University School of Medicine, Vanderbilt University School of Engineering, A 5217 Medical Center North, Nashville, TN 37232-2363. 155
156
CLARK TIBBETTS ET AL.
effort approaches 100 kilobasepair resolution, landmarks are m place to support large scale genomic DNA sequencing. The informatic challenges for analysis of these mapping and sequencing results is an area of longer term emphasis. Further support is set aside for studies of the ethical, legal and social implications of the HGP. The original objectives and 15 year timeline for the HGP have been recently reviewed and updated (Collins and Galas, 1993). The requisite mapping objectives are approaching closure, well in advance of the technology required to sustain timely completion of massive goals for DNA sequence determination (Cox et al., 1994). Three aspects of the DNA sequencing are particularly challenging for the HGP: Sequence quantity - The human genome, partitioned among its 23 chromosome pairs, represents three thousand million basepairs of A, G, C, T sequence. Including genomes of model organisms, an average of three hundred million finished basepairs of sequence per year must be sustained over the next decade. Only a few production sequencing centers have reached levels of 1 to 10 finished megabasepairs per year. The HGP sequencing objectives cannot be met within budget constraints by simply multiplying the number of centers applying contemporary technology. Sequence quality - Experimental determination of DNA sequences is more prone to error than the biological processes of DNA replication or gene transcription. The shotgun cloning-sequencing approach exploits high redundancy to assemble sequences of cosmid inserts (about 35 to 40,000 basepairs) from the component chunks of 800 to 1600 individual sequencing ladders. Multiply aligned sequences present discrepancies from ladder to ladder which require objective reconciliation. Consensus by plurality alone seems insufficient to reduce remaining ambiguities to a tolerably low level. Skilled human alignment proofreaders represent a costly and inefficient approach to sequencing genomic targets. Sequence cost - The target for the cost of finished DNA sequences is 50 cents per base pair. If genomes of human and model organisms together represent 10 billion basepairs, then this cost would average $333 million per year over 15 years. Combined NIH and DOE support for the HGP has yet to exceed $200 million per year. International agencies also make substantial contributions to the human genome initiative. Not all of these global resources are for sequencing-related expenditures. The 50 cent per base pair target is an upper limit for the cost of large scale genomic sequencing. Accurate cost accounting for genome sequencing is difficult, involving the research infrastructure, administration, indirect costs, personnel, equipment, supplies, communication costs. Conventional manual sequencing methods with radioactive labels approach or exceed 100 times this maximum cost target. Today's most efficient and productive large scale sequencing laboratories strive to meet $1 to $3 per finished basepair. Cost considerations fuel ongoing debates over shotgun vs. directed sequencing
PARSING OF GENOMIC GRAFFITI
157
strategies, over the merits of various degrees of accuracy, over the necessary level of annotation for sequences to be reported, and over the importance of complete sequencing through putatively less important, non-expressed or highly variable regions of the genome. Of greater concern is that the marginally cost-effective contemporary sequencing methods do not yet realize the throughput rates required to meet the long term sequencing targets of the HGP. Simply scaling up the numbers of groups committed to large scale sequencing with contemporary technology is not a practical solution within the cost constraints of the HGP. These cost and throughput considerations contribute to the high priority given to continuing development of automated DNA sequencing technology. The HGP also recognizes its dependence on interdisciplinary technology transfer to reach its cost and performance goals for DNA sequencing. There is little doubt that diverse technology developed for automation of DNA sequencing will transfer back to the medical, scientific and engineering communities, well into the post-HGP era. Successful automation of high throughput, large scale DNA sequencing must be effective at three levels: Wetware - the design and protocols for preparation and delivery of sequencing templates and labeled reaction products,
Hardware - instrument design, control and data acquisition for high throughput separation and monitoring of sequencing ladders images, and Software - high throughput signal conditioning, basecalling, editing and recording, for large parallel sample arrays, and all possibly in real time. The first generation implementations of automated DNA sequencing have been used intensively for megabase sequencing of the smaller genomes of bacteria, yeast and nematode, and for localized regions of insect, plant and mammalian chromosomes. This work has highlighted procedural bottlenecks at each level of the automated DNA sequencing process. These constraints are typically managed with labor intensive interceptions of the flow from input materials to finished sequences. This paper describes the approach taken in our laboratory to develop more effective software for high throughput, large scale automated DNA sequencing. This software employs artificial neural networks as fast transform functions for signal conditioning and basecalling, and also to implement pattern recognition-based contextual editing ofthe first pass sequence. The nature of artificial neural networks supports adaptation of this software system to different instruments and instrument designs. The structure of the basecalling and editing processors support serial or parallel implementations for high throughput operation in real-time.
158
CLARK TIBBETTS ET AL.
1.2. DNA sequencing: Rationale and automation. Methods for the determination of "gene-sized" arrays of nucleotide sequences, hundreds to thousands of base pairs, emerged from the developing repertoire of modern molecular genetics: restriction enzymes, reagent grade DNA polymerases, and chemically synthesized oligonucleotides; cloning and amplification of sequences of interest in plasmid- or phage-derived vectors; electrophoretic separation of oligonucleotides in denaturing polyacrylamide gels at single nucleotide resolution. The most familiar approach to DNA sequencing involves preparation of four ensembles of linear DNA molecules from an initially homogeneous DNA sample. Each molecule in these four ensembles has one polar end in common. Each molecule is distinguished by its unique length, measured in nucleotides from the common end. These four ensembles are tagged with radioactive, fluorescent or other label in a specific fashion associated with the identity of the base A, G, C, or T at its unique terminus. In multiplex sequencing methods, unique DNA sequence tags in the cloning vectors are exploited as targets for specific hybridization with labeled, complementary oligonucleotides. The DNA sequencing method of Maxam and Gilbert (1977) exploits a restriction enzyme to generate the common ends for the DNA sequence ensembles. The DNA strands are typically end-labeled using radioactive nucleotide substrates with polynucleotide kinase or DNA polymerase. Some strategies may require physical separation of complementary DNA strands. Nucleotide-specific chemical cleavage reactions generate the variable ends of the DNA sequence ensembles for analysis in gels as DNA sequencing ladders. The most widely used method today is that described by Sanger et al. (1977). In this approach the common end of the DNA sequence ensemble is the 5' terminus of an oligonucleotide primer for in vitro DNA synthesis on the template DNA strand to be sequenced. The variable lengths of the ensemble are generated by inclusion of chain- terminating dideoxynucleotides with the deoxynucleotide pools used in the template- directed DNA polymerase reaction. A variety of end-specific or inclusive labeling strategies have been developed for use with the Sanger procedure. Each of these methods presents the labeled ensembles of DN A oligonucleotides for analysis by electrophoresis. Denaturing polyacrylamide gels (Maniatis et al., 1975) separate these oligomers with single nucleotide resolution over a range from several to several hundred nucleotides. Fluorescent nucleotide labels and laser scanning photometers have been the basis of several instruments designed to control the electrophoretic separation of sequencing ladders and support data acquisition as photometric data streams for basecalling analysis (Ansorge et al., 1987; Brumbaugh et al., 1988; Connell et al., 1987; Prober et al., 1987; Smith et al., 1986). These systems exploit either of two strategies: Single fluorescent label- with the four base-specific ladders distributed
PARSING OF GENOMIC GRAFFITI
159
over four parallel lanes of the separating gel, or Four fluorescent labels - with spectrophotometric discrimination of the nucleotide-specific labels within a single lane of the gel. The basecalling rationale developed for these systems is primarily deterministic. Scanning along the time or displacement axis of the sequencing ladder image, the program attempts to detect the trace of each successive oligomer, and on detection then specify the identity of its terminal nucleotide, based upon the oligomer's lane position or characteristic fluorescence. This is a direct implementation of the method described for manual reading of radioactive sequencing gel autoradiograms. It seems likely that commercial software associated with various automated sequencing instruments has incorporated some second-order heuristics for proofreading the first pass sequence results, however these have not been openly discussed or disclosed. Second generation, gel electrophoresis-based automated sequencing instruments are now under development. Some of these systems exploit the more rapid separations and enhance resolution of DNA oligomers undergoing electrophoresis in thin slab or capillary gels. The greater resistance of thinner gels generates less disruptive Joule heating of gels at higher running voltages (Smith, 1993; Brumley & Smith, 1991; Kostichka et al., 1992; Swerdlow & Gesteland, 1990; Luckey et al., 1990; Huang & Mathies, 1993). Another high throughput automated sequencing approach is based on scanning of sequencing ladders blotted onto membranes, probed with vectorcomplementary multiplex oligomer labels (Beck & Pohl, 1984; Church & Kieffer-Higgens, 1988; Karger et al., 1992; Church et al., 1993; Cherry et al., 1993). Lloyd Smith (1993) has recently reviewed thin gel approaches as part of the near future of automated DNA sequencing. He also discusses the potential for novel non-gel- based methods such as sequencing by hybridization (SBH) and matrix assisted laser- desorption and ionization (MALDI) of samples for analysis by mass spectrophotometry. DNA sequencing at less than a few pennies per finished basepair is no longer part of science fiction or fantasy, but arrival of practical application may be yet lie a few years into the future. 2. Parsing of genomic graffiti 2.1. The generic automated sequencer. The four-dye fluorescence automated DNA sequencer described by Smith et al. (1986) was a prototype based on electrophoretic separations of single sequencing ladders in a capillary gel. This instrument was soon upgraded to a commercially successful multi-sample slab gel format (AB! Models 370, 373, 377) by Applied Biosystems (Connell et al., 1987). Data presented in this report were generated with different ABI 373A DNA sequencers in the laboratories of collaborating genome research centers (see Acknowledgments).
160
CLARK TIBBETTS ET AL.
Other commercial automated DNA sequencing instruments include the Du Pont Genesis 2000, Pharmacia A.L.F., LI-COR Model 4000, and Millipore BaseStation. The photo detector assembly of the ABI 373A traverses back and forth across the slab gel face, scanning at a fixed distance from the sample loading wells, illuminating the gel with an argon laser beam. Identification of individual sequencing ladders is correlated with the lateral scanner position of their fluorescence data traces. The detector samples fluorescence from the laser-illuminated, dye-labeled oligomers during their electrophoretic transport across the detector window. The photometric data are recorded as fluorescence transmitted through an array of four optical filters, selected for discrimination of the four dyes, based on their fluorescence emission spectra. A descriptive sketch of a generic DNA sequencing instrument, derived from the original description by Smith et al. (1986) is presented as Figure 2.1. The inventors' concept was to pass four raw photometric data streams as direct input to a computer. Appropriate software was anticipated to transform this input into an unambiguous temporal array of unit pulses, in four channels corresponding to the terminal nucleotides of the sequencing ladder. If the software succeeds to transform the raw data to such an idealized data presentation, then basecalling would be reduced to an essentially trivial process of event recording. Smith et al. (1986) recognized that the real informatic analysis of raw data streams from the automated DNA sequencer is complicated by: overlapping emission spectra of the four dyes, non-identical oligomer mobility offsets imparted by the four dyes, variable oligomer band intensities, and variable oligomer separations. The last of these complications was attributed to sequence-specific secondary structures of the ladder's oligomers. These occasional compressions in sequencing ladders appear as severe local distortions of the monotonic relation between average electrophoretic mobility and oligomer length. A more subtle, but more pervasive anomaly of oligomer separations arises due to nearest-neighbor interactions of terminal nucleotides during electrophoresis (Bowling et aI., 1991). Figure 2.2A illustrates 200 scans (20 minutes) across the four raw data channels from a sequencing run. Figure 2.2B shows an idealized output transformation of the raw data streams, as unit pulses appearing in basespecific channels, centered on the trace of each oligomer in the raw data streams. Generation of this idealized output transform is straightforward, particularly ifthe sequence of the DNA sample is known in advance (Golden et aI, 1993; Tibbetts et aI., 1994). As our software system has developed, we have appreciated its rationale and performance through analogy with the problem of pattern recognition in handwriting analysis. Distinct continuous patterns of script correspond to ordered series of discrete letter events. The cursive traces of individual letters are specific; although similar they are seldom identical. The traces
PARSING OF GENOMIC GRAFFITI
161
Upper buffer reservoir
Polyacrylamide Gel Slab or Capillary
Lower buffer reservoir Generic gel electrophoresis-based fluorescence automated DNA sequencer. The diagram illustrates a prototype DNA sequencing instrument, as origi-
FIG. 2.1.
nally described by Smith et al. (1986). Labeled oligonucleotide ensembles, as a DNA sequencing ladder, migrate by electrophoresis from top to bottom through the acrylamide gel-filled capillary. An argon laser illuminates the detector region of the gel, eliciting fluorescence from oligomers bearing base-specific dye-labels. The photodetector samples the fluorescence through four optical filters, passing raw data streams in four channels to the computer (black box). Smith et al.(1986) assigned the computer software to transform the complex incoming photometric raw data streams into unambiguous temporal arrays of unit oligomer signals, as indicated in the four channel idealized output transform below the instrument diagram.
162
CLARK TIBBETTS ET AL.
-
ca 1400 ca
01200
~1000
a:
Q)
0
800
cQ)
600
en Q)
400
0
...0
~
~
- ~~\.
~iF I ~
~
I! I I
I
I
0
o1.0
1IIII'IDl..--
f\:!7] ~=
~~~i~
"\:7'.V!
I I
200
u::: E
2A
Raw 3
I
I
2 B
----
...
Tare TarA TarG - - l > - TarT
I_ 0.6
--0--
~
C.
c5 0.4 ~ ca
Raw 2
I
1;; cca 0.8
Gi
Raw 0 Raw 1
0.2
I- 0.0 2600
2625
2650
2675
2700
2725
Scan Number
2750
2775
2800
Raw photometric data streams from an ABI 373A DNA sequencer and idealized output transformation. Figure 2.2A presents 200 scans of raw data
FIG. 2.2.
streams from sequencing ladder products of a Taq cycle sequencing reaction, using single stranded DNA template and the ABI dye primer reagents. The lower panel, Figure 2.2B, shows the placement of base-specific target output functions, unit step functions three scans wide. These output functions were placed below the traces of individual oligomers in the raw data streams above, taking advantage of the known sequence of the template DNA (a Hae III restriction fragment of phage q,X174 genome; sequence shown in panel is 5' - GGGTACGCAATCG - 3' ). As discussed by Golden et al. (1993), a neural network (or series of neural networks) may be trained to approximate the unknown function which transforms the raw data streams to the discrete output function arrays. The inset legends indicate use of filled circles (-e-), open circles (-0-), open squares (-0-) and open triangles (-6.-) to specify photometric data channels 0,1,2,3 and basespecific transforms C,A, G, T, respectively. Trace data output from software associated with conventional automated DNA sequencers illustrates these channels using colors in the order blue (C), green (A), black or yellow (G) and red (T).
PARSING OF GENOMIC GRAFFITI
163
of certain letters may differ markedly in the context of neighboring letters. The ABI 373A sequencer generates a "handwriting" image of the DNA sequence, which appears as four parallel raw data streams over time. This needs to be translated to ordered arrays of the four nucleotide "letters" in the correct order. Instrumental, physical, chemical and biochemical anomalies lead to non-uniform signal traces of oligomers ending in the same nucleotide. However these anomalous signatures are often correlated with the identities of neighboring nucleotides in the sequence. Consistent, sequence-associated variation of oligomer yields in sequencing ladders reflects events during the in vitro synthesis of the labeled oligomer ensemble (Sanger et al., 1977; Hindley, 1983; Smith et al., 1986; Ansorge et al., 1987; Connell et al., 1987; Tabor & Richardson, 1987; Kristensen et al., 1988). The variance of oligomer signal intensity in the raw data streams has a biochemical, DNA polymerase-specific determinant. Similarly consistent, sequence-associated variation in the separation of successive oligomers has been described (Bowling et al., 1991). Nearest neighbor interactions among the 3' terminal nucleotides of each oligomer influence the conformation of oligomers during electrophoretic transport. This is a biophysical determinant of the variable resolution of successive oligomers in the raw data streams. The four ABI dye labels, appearing on the four subsets of the oligomer ladder ensemble, additionally contribute to variance of oligomer separations in the raw data streams. This may nevertheless be of informative value as the dye-mobility shifts confer particular attributes to the trace signatures of adjacent nucleotides in the sequence ladder (Golden et al., 1993). Thus the ABI (or any other) four dye sequencing system has both chemical and biophysical determinants of oligomer separation variance. The native basecalling software of the ABI 373 system (Analysis@) reports terminal nucleotide identities of successively detected oligomers, based on the fluorescence emission of each oligomer's associated dye. Work from our laboratory (Bowling, 1991; Golden et al., 1993; Tibbetts et al., 1994; Tibbetts & Bowling, 1994) has established that basecalling accuracy can be improved through algorithmic incorporation of the additional information latent in the relative yields and relative separations of each oligomer's trace in the data stream. This favorable situtation is further improved in a contextual, or pattern recognition analysis of multiple informative parameters (fluorescence, separation, yield) over local arrays of oligomers. Returning to the handwriting analogy, the parsing of genomic graffiti (raw data streams) operates at two levels: Scanning analysis of the raw data streams to detect, classify and report in order the traces of individual oligomers, and Contextual analysis of the multiple informative parameters from the trace data representing a local array of neighboring oligomers.
164
CLARK TIBBETTS ET AL.
In parsing real handwriting it is often unnecessary to identify each and every letter before words are recognized from the context of neighboring letters. The neural network implementation of a contextual basecalling editor, described later in this report, often specifies insertion of the correct nucleotide into single-base gaps of first pass sequence. The processes of scanning and contextual analysis are closely coupled in the reading of cursive text. Likewise the basecalling and basecall editing processors can be linked, with favorable throughput for real-time applications. The basecaller operates by processing successive data transforms through a narrow sliding window of most recently acquired data. Once initialized, each new line of raw data input generates the next fully processed line of output. On each scan cycle each nucleotide-specific channel is screened for possible presence of an oligomer event in the window. As each new basecall is reported, it can be treated as the next incoming component of a window representing the most recent five basecalls. The multiple informative attributes (fluorescence, intensity, separation) of each of these basecalls can be passed through a trained neural network to qualify the tentative identity of the base at the center of the window. This cascade of data through the architecture of the linked processors supports parallel applications of this software model in real time basecalling operations. 2.2. Scanning analysis - ordered arrays of individual oligomers Our current signal conditioning software for the ABI 373A DNA sequencer represents three transformations across a narrow window of the most recently acquired raw photometric data (Tibbetts et aI, 1995; Golden et al., 1995). These transformations, illustrated in Figure 2.3, represent a pipeline of data processing through three stages: an algorithmic baseline subtraction for each of the four raw data streams (compare data traces of Figure 2.3A and Figure 2.3B), a neural network mapping transformation from oligomer raw data traces to narrow, unit step functions (compare data traces of Figure 2.3B and Figure 2.3C), and a neural network event filter mapping intermediate transforms to final output form (compare data traces of Figure 2.3C and Figure 2.3D). The output transform of the signal conditioning basecalling processor (Figure 2.3D) compares favorably with the idealized output transform (Figure 2.1B), as originally envisioned by Smith et al. (1986). The first step of the three stage processor, baseline subtraction, is implemented by determination of each channel's minimum value within a sliding interval of ±100 scans. A satisfactory, alternative implementation has been developed which uses a neural network, as in the second and third stages of the overall processor. This approach maps 9 scans of the 4 channel raw data to a single 4 channel line of baseline subtracted data values in the center of the narrow window. The neural network approach may
PARSING OF GENOMIC GRAFFITI 'D C 1200
165
3A
~
'" 1000
~
;
Raw-O
Raw-'
+
Raw-2
I ~
Raw-3
200
"
ii:
roo,,------------------------------------, 38
iii ...
.....
C
'"
(i)500
f:
Bad-I
400
Bad-2 Bad-3
C
~
300
~ 200 " 100rlA
ii:
o
1550 1510
~
1.0
~C
0.8
~
I-
1570
1&80
1690
1600
1&10
1620
1130
I~
A
--
-
~
~
---0-
0.4
~ J!! .5
..0
1,
0.2
~
1550 1560
u
~
1.0 ..0 0.1
:;
D.'
0
0.2
~
Ii.....~
X
1570 1&80 1590 1800
C
~
11550
1660
3C
0.&
~
I
1640
0.0 mD~~1
A
~
~
1650 1860
"
3D
-
~~
-6-
__
Csd-T
iA
1610 1620 1830 1a40
~
Csd -C Cad-A cad -0
Efd-C
Efd-A
Efd-G
Efd-T
~~mDa~_~a~
Scan Number
Processing of raw sequencer data through serial transformations of background subtraction, core mapping and output event filter. The four panels
FIG. 2.3.
correspond to 85 scans (8.5 minutes) of raw sequencer data, as described in the legend to Figure 2.2. The baseline components of the four raw data streams (Figure 2.3A) are removed in the first stage of the transform, resulting in the display of fluorescence signals in Figure 2.3B. The second stage core transform (Figure 2.3C) results in substantial color separation (transform from photometer space to dye space) but the output functions vary in height and width from the idealized target step functions. Output from the second stage transform is pipelined to the third stage oligomer trace event fil. ter, with final output showing well resolved unit events corresponding to the sequence of bases (Figure 2.3D). The inset legends indicate use of filled circles (-e-), open circles (-0-), open squares (-0-) and open triangles (-.6.-) to specify photometric data channels 0,1,2,3 and base-specific transforms C,A, G, T respectively.
166
CLARK TIBBETTS ET AL.
find greater utility in increasingly parallel implementations of automated sequencers. The second and third stages of the signal conditioning processor are implemented as neural networks. Each has an architecture which is trained to map 9 scans of 4 data channels (36 component primary input vector) to a single line of transformation target values (4 component output vector). In lieu of conventional hidden layers for the networks, an expanded array of pairwise products of the primary input data is computed (Golden et al., 1993). The target vectors correspond to the channel-specific values for the 3-scan wide unit step functions shown in Figure 2.1B. Unpruned neural networks for the second and third transforms each have 2668 connection weights. Training sets for the neural networks are constructed by mapping the target functions with known DNA sequences and the ladder image traces generated with the ABI 373A DNA sequencer. Suitable benchmark DNA sequences for generation of training sets have been obtained as well characterized plasmid and phage vectors, as well as edited consensus sequences of cosmid inserts from shotgun sub cloning-sequencing projects. Twenty sequencing ladders, representing different templates and gel runs, provide training sets approaching 100,000 input-output vector pairs. Reporting basecalls in the order of their discovery in the event-filtered output window (Figure 2.3D) provides a first pass estimate of the DNA sequence. Typical first pass performance is 97 to 98% correct calls to a length of 400 called bases. This result was obtained for ABI 373A raw data with Taq or Sequenase dye-primer reactions, using standard length gels and running conditions. First pass basecalling performance, with distributions of the types and numbers of basecalling errors, are presented for three independent data sets not used for network training, Table 2.1. A small, rolling memory buffer associated with the narrow window processor enables computation of the intensity and peak centers for oligomer traces associated with credible basecalls. These results are reported together with each basecall in an output file as a feature table. This anticipates the downstream analysis of local clusters of basecalls and their multiple informative parameters for pattern recognition-based editing of the first pass sequence. An inner product of the event-filtered output data window with the background-subtracted fluorescence data window, treating the 9scan x 4-channel window matrices as 1 x 36 vectors, provides an estimate of signal intensity associated with the called base. If the individual column elements of the inner product are multiplied by the corresponding scan line numbers, then division by the estimated oligomer intensity provides a weighted estimate of the oligomer trace's peak center (centroid). The relative intensities and center-to-center separations are readily derived from the results reported by the signal conditioning basecalling software's output feature table. The informative quality of these automated measurements of oligomer intensity and separation can be appreciated as
167
PARSING OF GENOMIC GRAFFITI TABLE 2.1
First Pass Basecalling Performance First pass DNA sequences were generated using three different implementations of the signal conditioning and basecalling processor, specific for individual ABI 373A instruments and sequencing chemistry. These sequences were aligned with the known benchmark sequences of the template DNA (Hae III fragments of the phage ¢X 174 genome or consensus sequence of the A14 cosmid insert determined by J( oop et al., 1993). Discrepancies between the first pass basecalls and known sequences are categorized as: Correct, Overcalls (basecalls to be deleted), Undercalls (missing bases to be inserted) and Miscalls (substitutions to correct basecall identity). Only 27 instances of the (swap) and (gap) error categories appeared in the total survey of over 24,000 first pass basecalls, and these minor classes are omitted from the summary below.
Sequencing Data Source (number of ladders) Basecall Category Correct Calls Overcalls Undercalls Miscalls
TOTAL CALLS Correct Calls
MIT-Taq (21} 6551 46 94 7 6698 97.8%
UWash-Taq (27) 9180 193 69 6 9448 97.1%
UWash-T7 (23} 7854 42 132 2 8060 97.3%
sequence-specific correlations from the basecalling analysis of different sequencing ladders. Figure 2.4 illustrates correlations among series of oligomer relative separations, computed as described above. These represent the same template DNA sequences, but sequencing reactions with different DNA polymerases (Figure 2.4A). Different DNA sequences, even in ladders generated with the same DNA polymerase, do not result in significant correlations of their oligomer's relative separations (Figure 2.4B). Short-range but significant correlation of series of oligomer separations is observed for short runs of identical sequences within otherwise dissimilar sequences (Figure 2.4C). Figure 2.5 illustrates similar sequence-specific correlation ofthe relative intensities of oligomer traces for series of identical DNA sequences. Correlations are most significant when the sequencing ladders are generated using the same DNA polymerase and reaction conditions (Figure 2.5A), but minimal when comparing ladders generated using different DNA polymerases (Figure 2.5B). 2.3. Contextual analysis - local patterns of multiple oligomer traces. Earlier reports from this laboratory described neural network mapping of multiple informative parameters (basecall, oligomer fluorescence or lane position, oligomer signal intensity, oligomer relative separation) to a battery of editing options (Golden et al., 1993; Tibbetts et al., 1994). We have extended this approach to develop an effective automated basecall editor for the first pass sequences generated by our ABI 373A-adapted signal conditioning and basecalling processor (Torgersen et al., 1995). Called sequences are first aligned as text strings with the known DNA
168
CLARK TIBBETTS ET AL.
2.4
C
o :;
Ci
~
tJ)
4 A
Taq Ladder 5
Taq Ladder 6 Seq Ladder 3
2.2 ZO 1.8
1.6
1.4 1.2
~ 1.0t\~",,"H'--lI-L-'t-+\+--hrf--.J\-H:-f---'t-"l-~t1+i -.;
Qi
0.6
a:
o.o.l,..~~~~~~~~~~~~~~~~~~~~~
ATTTTGTTTTCAGTGATTGTATGAGGCATTCCTAGA Basecall Sequence
4B
Taq Ladder 7 Seq ladder 4
Seq ladder 5
o.oo.j.,....~~~~~~~~~~~~~~~~~~~~....-.-J
CTGATTTCTAGACTCTGAGGGTTTTCATTGCTGTCC GAAATAATTTTTCTGAGGCCGTTAAGAAGTGTTATA ATTTATAATATTGATTCTATTGCTGTTTATTTTTCA
Basecall Sequences
c:
1.4
4 C
Seq ladder 4 Seq ladder 5
.2
Gi
.. .. [
1.2
1/1
1.0
>
i
Qi
0.8
a:
0.6
A
T
A A
T T
T T
T
T
T
T
T
T
c
c
T A
Basecall Sequences
Correlations of series of oligomer separations. Figure 2.-4A shows the successive values of measured oligomer separations corresponding to the DNA sequence posted below the abscissa. Substantial variance is indicated, but it is extraordinary in its consistency. The three ladders from which the measurements were made were generated 11.sing different sequencing strategies and the samples were run on at different times and on different gels. Figure 2.-4B shows three series of relative separation values, corresponding to the three dissimilar sequences posted below the graph. There is no apparent correlation. However, a short heptamer sequence is common to part of the sequences representing Seq ladder -4 and Seq ladder 5. Figure 2.-4 C demonstrates that short sequence identities in otherwise unrelated sequences retain locally correlated patterns of oligomer separation values.
FIG. 2.4.
169
PARSING OF GENOMIC GRAFFITI
3.0 5A
__ ~
~ 2.5
--0--
Taq Ladder 2 Taq Ladder 3 Taq Ladder 4
rn
c:::
G)
2.0
c:::
G)
> ftS
Cii
a:
1.5 1.0 0.5 o.o~~~~~~~~~~~~~~~~~~~~~~~~~
TTCTGTGTCCCAAATACTCCTTCAAGTATCCTTTTC 8asec~1
>-
2.5
Sequence __
58
~ --0-
Taq Ladder 5 Seq Ladder 1 Seq Ladder 2
~ 2.0
G)
c::: 1.5
G)
.~ 1.0~~++~~~~~~~-k~~~~~~~~~~~~~~ ftS
G)
a:
0.5 0.0+r~T-~~~T-~~~~~~~~~~~~~~~~~-r~~~~
CTGATTTCTAGACTCTGAGGGTTTTCATTGCTGTCC
8asecall Sequence FIG. 2.5. Correlations of series of oligomer intensities. Figure 2.5A illustrates the consistent complex pattern of oligomer intensities from ladders representing the same sequence and same sequencing reaction conditions. The three Taq cycle sequencing reactions represent the same sequence from different, overlapping cosmid subclones run on different gels. The lower panel, Figure 2.5B illustrates the unrelated patterns of oligomer intensity variation. Again three ladders represent the same template DNA sequence from overlapping cosmid subclones. The identical sequences show similar patterns of intensity variation for the products of the two T7-Sequenase reactions, but these are essentially unrelated to the pattern of the Taq cycle sequencing ladder.
170
CLARK TIBBETTS ET AL.
sequences of the benchmark DNA templates. The locations of discrepancies are readily identified. Often, in cases of short runs of identical basecalls there may be too many or too few of the bases included in the called sequence. Some judgment must be exercised for identification of the specific basecall to be edited, using cues such as extraordinary separations (gaps for missing bases) or minute signal intensities (candidates for deletion). Our developing practice led us to identify 15 categories of basecalls with respect to possible editing actions. The most abundant category (97 to 98%) is "JlOP", the passive "no operation" state for correct calls in first pass sequences. The erroneous basecalls in first pass sequences are distributed among the 14 active editing categories. The majority of these erroneous basecalls represent missing bases or extranumerary basecalls (indel errors). In a survey of 91 sequencing ladders with 24,206 total basecalls, Table 2.1, indel errors represented 97 % of the 591 first pass basecalling errors. The input data vector for the neural network basecall editor represents arrays of 5 successive basecalls, with associated values derived from the feature table output file of the first pass basecalling processor. Each basecall is represented as four Boolean variables: A?, G?, C?, or T? Each basecall's peak center and its center-to-center separation from the preceding oligomer are included as measured variables. The integrated signal intensity of each oligomer event and its relative intensity, or ratio to the intensity value of the preceding oligomer, are included variables. The current editor also includes a parameter evaluated as the degree of matching between the event filtered output across the 36 cell processor window (9 scans, 4 channels) and a mask representing the idealized output (as the corresponding group of 9 successive target vectors from the neural network training sets, as from Figure 2B). This parameter is a reasonable indicator of raw data quality, reflecting the resolution of the oligomer's trace in the raw data stream and signal to noise ratio across the oligomer's trace in the raw data streams. The input layer of the basecall editing neural network presents data vectors of 45 primary components (5 basecalls, 9 parameters each). This input vector is expanded as the computed array of 990 pairwise products of these 45 primary components. This is similar to the approach we have taken in development of the signal conditioning neural networks, bypassing the hidden layer array(s) of conventional feed-forward backpropagation architectures (Golden et al., 1993; Tibbetts et al., 1994). The composite input vectors (1036 nodes, including the bias = 1 node) are mapped to 15 categories of editing actions for the third called base in the center of the 5 basecall wide window. The editing nodes of the network's output layer are: nop - no edit, correct call, delA, delC, delG, delT - possible position 3 overcalls, insA, insC, insG, insT - possible undercall between positions 3 and 4,
PARSING OF GENOMIC GRAFFITI
171
xtoA, xtoC, xtoG, xtoT - possible miscall at position 3, swap - reverse called bases at positions 3 and 4, and gap - 2 or more bases are missing between positions 3 and 4. The gap category is most often associated with simple compression artifacts in the sequencing ladders trace data. An outline of the architecture of the basecalling editor network is presented as Figure 2.6. Training sets for neural network basecall editors have been constructed from DNA sequencing data in sets of approximately 20 sequencing ladders, representing 6000 to 8000 basecalls. In the work presented here, basecall editor training sets, for ABI 373A data, were prepared from the earlier survey of first pass basecalling results, Table 2.1. The stringency and specificity of these implementations of the neural network basecall editor are remarkable. Instances of false corrections are exceedingly rare (0.02 %). When an error is indicated as a high output value for one of the 14 error category output nodes (> 0.90) , the outputs of the other 13 nodes remain low « 0.05). A survey of first pass and edited sequences is presented in Table 2.2. It is important to note that the frequency of correct basecalls mapped as editor-specified false corrections is very low ( 3 /13,790 = 0.02 %). This favorable performance likely reflects the abundant representation of correct basecall vectors (_nop) in the basecall editor training data sets. Indel error types are the most abundant of basecalling errors represented in first pass sequences. It appears to be more difficult for the neural network basecall editor to specify insertion of a missing base (insN), compared to specifying the deletion of an extranumerary basecall (deIN). Although the architecture of the editor network has independent, parallel arrays of connections from the input layer to each of the output nodes, in those cases where an undercall error is recognized, only one of the four undercall nodes has a high output value to specify insertion of the correct missing base. The survey results in Table 2.2 indicate specific corrections for 29% of the miscall (xtoN), basecall order (swap) and compression (_gap) error types in first pass sequences from ABI 373A raw data. This level of performance is somewhat surprising, since such error categories were so sparsely represented in the training data and evaluation survey data sets (32/17508 for the training data set, 24/14159 for the survey test set). The results presented in Table 2.2 also describe the performance of similar neural network basecall editors developed for application with the Millipore BaseStation, a single fluorescent label, four lane automated DNA sequencer (Golden et al., 1993; Tibbetts et al., 1994). First pass sequences, for T7-Sequenase or thermophilic Vent DNA polymerase reactions, were generated using the native basecalling software developed by Millipore / BioImage. Modified software was written by BioImage programmers to provide estimates of oligomer yields and separations. This data was dumped
172
CLARK TIBBETTS ET AL.
to files as peak heights and the numbers of gel image pixel lines separating successive peaks, respectively. The BaseStation's first pass basecalling overall accuracy is less than we have attained in our analysis of data from the ABI 373A. Nevertheless, the neural network basecall editor implementations for the BaseStation perform remarkably well. Perhaps many of the BaseStation's more abundant first pass basecalling errors are associated with more readily recognized patterns of trace data parameter arrays. We have also evaluated basecalling performance in terms of the average positions, in sampled sequencing ladders, at which the first, second, third, etc. basecalling errors are encountered. Figure 2.7 presents such an analysis of ordinal basecalling error distributions for a sample of 21 Taq cycle sequencing ladders. Data points are plotted with horizontal error bars that representing ± one standard error about the mean position of the corresponding ordinal basecalling error. This presentation provides a basis for statistically objective evaluation of small, but significant differences in basecalling accuracy across the interpretable range of oligomer lengths in different sets of DNA sequencing ladders. The overall basecalling accuracy of first pass and edited sequences are 97.5% and 99.3%, similar to results shown for different ABI 373A Taq cycle sequencing data in the Table 2.1. Sequences corrected as indicated by the neural network basecalling editor compare favorably with the sequences called by the native ABI 373A Analysis@ software. Although the basecalling algorithms employed in that commercial software are proprietary and non-disclosed, it seems likely that their procedures invoke a deterministic first pass procedure with some form of heuristic filtering to proofread basecalls of the first pass sequence. The results presented in Table 2.2 also describe the performance of similar neural network basecall editors developed for application with the Millipore BaseStation, a single fluorescent label, four lane automated DNA sequencer (Golden et al., 1993; Tibbetts et al., 1994). First pass sequences, for T7-Sequenase or thermophilic Vent DNA polymerase reactions, were generated using the native basecalling software developed by Millipore / BioImage. Modified software was written by BioImage programmers to provide estimates of oligomer yields and separations. This data was dumped to files as peak heights and the numbers of gel image pixel lines separating successive peaks, respectively. The BaseStation's first pass basecalling overall accuracy is less than we have attained in our analysis of data from the ABI 373A. Nevertheless, the neural network basecall editor implementations for the BaseStation perform remarkably well. Perhaps many of the BaseStation's more abundant first pass basecalling errors are associated with more readily recognized patterns of trace data parameter arrays. We have also evaluated basecalling performance in terms of the average positions, in sampled sequencing ladders, at which the first, second, third, etc. basecalling errors are encountered. Figure 2.7 presents such an analysis of ordinal basecalling error distributions for a sample of 21 Taq
173
PARSING OF GENOMIC GRAFFITI
Single Layer Neural Network Basecall Editor Primary Input Array of Multiple Informative Parameters Window of Five BasecaUs
N-2
N-1
N
N+1
N +2
Computed Higher Order Term Array 990 Pairwise Products from the 45 terms of the Primary Input Array
Bias Node 1
~ Single Layer of Connections from the primary and computed input arrays (1036 nodes) to the 15 editing output node array (15, 540 connections)
@) ~