Genetic Mapping and DNA Sequencing [Reprint ed.] 1461268907, 9781461268901

Genetics mapping, physical mapping and DNA sequencing are the three key components of the human and other genome project

545 128 22MB

English Pages 236 [229] Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Genetic Mapping and DNA Sequencing [Reprint ed.]
 1461268907, 9781461268901

Citation preview

The IMA Volumes in Mathematics and its Applications Volume 81 Series Editors Avner Friedman Robert Gulliver

Springer Science+Business Media, LLC

Institute for Mathematics and its Applications IMA The Institute for Mathematics and its Applications was established by a grant from the National Science Foundation to the University of Minnesota in 1982. The IMA seeks to encourage the development and study of fresh mathematical concepts and questions of concern to the other sciences by bringing together mathematicians and scientists from diverse fields in an atmosphere that will stimulate discussion and collaboration. The IMA Volumes are intended to involve the broader scientific community in this process. A vner Friedman, Director Robert Gulliver, Associate Director

********** IMA ANNUAL PROGRAMS

1982-1983 1983-1984 1984-1985 1985-1986 1986-1987 1987-1988 1988-1989 1989-1990 1990-1991 1991-1992 1992-1993 1993-1994 1994-1995 1995-1996 1996-1997 1997-1998

Statistical and Continuum Approaches to Phase Transition Mathematical Models for the Economics of Decentralized Resource Allocation Continuum Physics and Partial Differential Equations Stochastic Differential Equations and Their Applications Scientific Computation Applied Combinatorics Nonlinear Waves Dynamical Systems and Their Applications Phase Transitions and Free Boundaries Applied Linear Algebra Control Theory and its Applications Emerging Applications of Probability Waves and Scattering Mathematical Methods in Material Science High Performance Computing Emerging Applications of Dynamical Systems

Continued at the back

Terry Speed Michael S. Waterman Editors

Genetic Mapping and DNA Sequencing With 35 Illustrations

Springer

Terry Speed Department of Statistics University of California at Berkeley Evans Hall 367 Berkeley, CA 94720-3860 USA

Michael S. Waterman Department of Mathematics and Molecular Biology University of Southern California 1042 W. 36th Place, DRB 155 Los Angeles, CA 90089-1113 USA

Series Editors: A vner Friedman Robert Gulliver Institute for Mathematics and its Applications University of Minnesota Minneapolis, MN 55455 USA

Mathematics Subject Classifications (1991): 62F05, 62FlO, 62F12, 62F99, 62H05, 62H99, 62K99, 62M99, 62PlO Library of Congress Cataloging-in-Publication Data Genetic mapping and DNA sequencing/[edited by) Terry Speed, Michael S. Waterman. p. cm. - (IMA volumes in mathematics and its applications; v.81) Includes bibliographical references. ISBN 978-1-4612-6890-1 ISBN 978-1-4612-0751-1 (eBook) DOI 10.1007/978-1-4612-0751-1 1. Gene mapping-Mathematics. 2. Nucleotide sequenceMathematics. I. Speed, T.P. 11. Waterman, Michael S. III. Series. QH445.2.G448 1996 574.87'322'0151-dc20 96-18414 Printed on acid-free paper.

© 1996 Springer Science+Business Media New York Originally published by Springer-Verlag New York, Inc in 19% Softcover reprint of the hardcover 1st edition 1996 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher Springer Science+Business Media, LLC, except for brief excerpts in connection with reviews or scholarly analysis. Use in conneetion with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general deseriptive names, trade names, trademarks, ete., in this publieation, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely byanyone. Authorization to photoeopy items for internal or personal use, or the internal or personal use of specifie elients, Springer Science+Business Media, LLC, provided that the appropriate fee is paid directly to Copyright Clearanee Center, 222 Rosewood Drive, Danvers, MA 01923, USA (Telephone: (508)750-8400), stating the ISBN and title of the book and the first and last page numbers of eaeh article eopied. The copyright owner's consent does not include copying for general distribution, promotion, new works, or resale. In these cases, specific written permission must first be obtained from the publisher. Produetion managed by Hal Henglein; manufacturing supervised by Jacqui Ashri. Camera-ready eopy prepared by the IMA. 987654321 ISBN 978-1-4612-6890-1

SPIN 10524747

FOREWORD This IMA Volume in Mathematics and its Applications

GENETIC MAPPING AND DNA SEQUENCING

is one of the two volumes based on the proceedings of the 1994 IMA Summer Program on "Molecular Biology" and comprises Weeks 1 and 2 of the four-week program. Weeks 3 and 4 will appear as Volume 82: Mathematical Approaches to Biomolecular Structure and Dynamics. We thank Terry Speed and Michael S. Waterman for organizing Weeks 1 and 2 of the workshop and for editing the proceedings. We also take this opportunity to thank the National Institutes of Health (NIH) (National Center for Human Genome Research), the National Science Foundation (NSF) (Biological Instrumentation and Resources), and the Department of Energy (DOE), whose financial support made the summer program possible.

Avner Friedman Robert Gulliver

v

PREFACE Today's genome projects are providing vast amounts of information that will be essential for biology and medical science in the 21st century. The worldwide Human Genome Initiative has as its primary objective the characterization of the human genome. Of immediate interest and importance are the locations and sequences of the 50,000 to 100,000 genes in the human genome. Many other organisms, from bacteria to mice, have their own genome projects. The genomes of these model organisms are of interest in their own right, but in many cases they provide valuable insight into the human genome as well. High-resolution linkage maps of genetic markers will play an important role in completing the human genome project. Genetic maps describe the location of genetic markers along chromosomes in relation to one another and to other landmarks such as centromeres. Genetic markers in humans include thousands of genetic variants that have been described by clinicians and that in other organisms are called mutants as well as the more recent molecular markers, which are based on heritable differences in DNA sequences that may not result in externally observable differences among individuals. Such molecular genetic markers are being identified at an increasing rate, and so the need for fast and accurate linkage and mapping algorithms of ever-increasing scope is also growing. In addition to playing an important role in long-term genome projects, genetic maps have many more immediate applications. Given data from suitably designed crosses with experimental organisms, or from pedigrees with humans and other animals, new mutations, genes, or other markers can frequently be mapped into close proximity to a well-characterized genetic marker. This can then become the starting point for cloning and sequencing the new mutation or gene. Approaches like this have given detailed information about many disease genes and have led to success in determining genes causing cystic fibrosis and Huntington's disease. During meiosis prior to the formation of gametes, a random process known as crossing over takes place one or more times on the average on each chromosome. Crossovers cannot be observed directly, but they can leave evidence of having occurred by causing recombination among nearby genetic markers. When two (or more) markers are inherited independently, recombinants and non-recombinants are expected in equal proportions among offspring. When the markers appear to be co-inherited more frequently than would be expected under independence, a phenomenon called genetic linkage, this is taken as evidence that they are located together on a single chromosome. The first paper in this volume, by McPeek, explains this process in greater detail than can be done here. The genetic distance between

vii

Vlll

PREFACE

two markers is defined to be the expected number of crossovers per meiosis occurring between the two markers on a single chromosome strand. Since crossovers cannot be observed, only recombination patterns between markers can be counted. Thus, the quantities that can be estimated from cross or pedigree data are recombination fractions, and these need to be connected to genetic distances using a statistical model. Most workers use a model based on the Poisson distribution, which is known not to be entirely satisfactory, and some current research addresses the question of just what is a suitable model in this context. The appropriateness of the Poisson model is considered in the papers by Keats and Ott, and alternatives to it are discussed by Speed. Given a statistical model for the crossover-recombination process, there remain formidable problems in ordering and mapping a number of markers from a single experiment or set of pedigrees, as well as difficulties of incorporating new data into existing maps. Most of the problems of the first kind stem from the many forms of incompleteness that arise with genetic data. At the lowest level, data may simply be missing. However, we may have data, e.g. on disease status, that can change over time, so that even a disease phenotype is not unambiguously observed. Many genetic diseases exhibit this so-called incomplete penetration. At the next level, we may have certain knowledge of phenotypes but, because of the trait being dominant or recessive, not know the genotype. Finally, to carry out linkage or mapping studies, calculations need to be based on the haplotypes of a set of markers; that is, we need to know which alleles go together on each chromosome. A special class of missing data problems arises when we attempt to locate genes that contribute to quantitative traits, which are not simply observable. Standard statistical methods such as maximum likelihood remain appropriate for these problems, but their computational burden grows quickly with the number of markers and the size and complexity of pedigrees. Similar difficulties arise with other organisms, and each presents its own problems, for cross or pedigree data from, say, maize, fruit flies, mice, cattle, pigs and humans, all have their own unique features. There are likely to be many challenging statistical and computational problems in this area for some time to come. For an indication of some of these challenges, the reader is referred to the papers in this volume by Dupuis, Lin and Sobel et al. Together they survey many of the problems in this area of current interest. The next level of DNA mapping is physical mapping, consisting of overlapping clones spanning the genome. These maps, which can cover the entire genome of an organism, are extremely useful for genetic analysis. They provide the material for less redundant sequencing and for detailed searches for a gene among other things. Complete or nearly complete physical maps have been constructed for the genomes of Escherichia coli, Saccharomyces cervisiae, and Caenorhabdits elegans. Many efforts are under

PREFACE

IX

way to construct physical maps of other organisms, including man, mouse and rice. Just as in DNA sequencing, to be mentioned below, most mapping experiments proceed by overlapping randomly chosen clones based on experimental information derived from the clones. In sequencing, the available information consists of a sequence of the clone fragment. In physical mapping, the information is a less detailed "fingerprint" of the clone. The fingerprinting scheme is dependent on the nature of the clones, the organisms under study, and the experimental techniques available. Clones with fingerprints that have sufficient features in common are declared to overlap. These overlapping clones are assembled into islands of clones that cover large portions of the genome. Physical mapping projects are very labor and material expensive, and they involve many choices as to experimental technique. The very choice of clone type varies from about 15,000 bases (Lambda clones) up to several hundred thousand bases (yeast artificial chromosomes or YACs). In addition, the fingerprint itself can range from a simple list of selected restriction fragment sizes to a set of sites unique in the genome. Different costs, in material and labor, as well as different amounts of information will result from these choices. Statistics and computer science are critical in providing important information for making these decisions. The paper of Balding et at. develops strategies using pools of clones to find those clones possessing particular markers (small pieces of DNA called sequence tagged sites or STSs). Their work involves some interesting statistics. The most detailed mapping of DNA is the reading of the sequence of nucleotides. One classic method is called shotgun sequencing. Here a clone of perhaps 15,000 letters is randomly broken up into fragments that are read by one run of a sequencing machine. These reads are about 300 - 500 letters in length. The sequence is assembled by determining overlap between the fragments by sequence matching. The sequence is not perfectly read at the fragment level, and this is one source of sequencing errors. Another source of errors comes from the repetitive nature of higher genomes such as human. Repeated sequences make it very difficult to find the true overlap between the fragments and therefore to assemble the sequence. Statistical problems arise in estimating the correct sequence from assembled fragments and in estimating the significance of the pairwise and multiple overlaps. The paper of Huang is an update of the original "greedy" approach of Staden. This paper takes the fragment sequences as input. Of particular note is the use of large deviation statistics and computer science to very rapidly make all pairwise comparisons of fragments and their reverse complements. Scientists are working to make the existing sequencing methods more efficient and to find new methods that allow more rapid sequence determination. For example, in multiplex sequencing, the information of several gel runs is produced in a single experiment. In another direction, automated machines such as the Applied Biosystems 373A sequencer produce

x

PREFACE

machine-readable data for several gel runs in parallel. Two of the papers in this volume, Nelson and Tibbetts et ai., are about the inference of sequence from raw data produced by these machines. Modern molecular genetics contains many challenging problems for mathematicians and statisticians, most deriving from technological advances in the field. We hope that the topics discussed in this volume give you a feel for the range of possibilities in this exciting and rapidly developing area of applied mathematics. Terry Speed Michael S. Waterman

CONTENTS Foreword ............................................................. v Preface ............................................................. vii An introduction to recombination and linkage analysis. . . . . . . . . . . . . . . .. 1 Mary Sara McPeek Monte Carlo methods in genetic analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15 Shili Lin Interference, heterogeneity and disease gene mapping ................. 39 Bronya Keats Estimating crossover frequencies and testing for numerical interference with highly polymorphic markers ......................... 49 Jurg Ott What is a genetic map function? ..................................... 65 T.P. Speed Haplotyping algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89 Eric Sobel, Kenneth Lange, Jeffrey R. O'Connell, and Daniel E. Weeks Statistical aspect of trait mapping using a dense set of markers: a partial review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 111 Josee Dupuis A comparative survey of non-adaptive pooling designs. . . . . . . . . . . . . .. 133 D.J. Balding, w.J. Bruno, E. Knill, and D.C. Torney Parsing of genomic graffiti. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 155 Clark Tibbetts, James Golden, III, and Deborah Torgersen Improving DNA sequencing accuracy and throughput ............... 183 David O. Nelson Assembly of shotgun sequencing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 207 Xiaoqiu Huang

xi

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS MARY SARA McPEEK· Abstract. With a garden as his laboratory, Mendel (1866) was able to discern basic probabilistic laws of heredity. Although it first appeared as a bafHing exception to one of Mendel's principles, the phenomenon of variable linkage between characters was soon recognized to be a powerful tool in the process of chromosome mapping and location of genes of interest. In this introduction, we first describe Mendel's work and the subsequent discovery of linkage. Next we describe the apparent cause of variable linkage, namely recombination, and we introduce linkage analysis. Key words. genetic mapping, linkage, recombination, Mendel.

1. Mendel. Mendel's (1866) idea of enumerating the offspring types of a hybrid cross and his model for the result provided the basis for profound insight into the mechanisms of heredity. Carried out over a period of eight years, his artificial fertilization experiments involved the study of seven characters associated with the garden pea (various species of genus Pisum), with each character having two phenotypes, or observed states. The characters included the color of the petals, with purple and white phenotypes, the form of the ripe seeds, with round and wrinkled phenotypes, and the color of the seed albumen, i.e. endosperm, with yellow and green phenotypes. Mendel first considered the characters separately. For each character, he grew two true-breeding parental lines, or strains, of pea, one for each phenotype. For instance, in one parental line, all of the plants had purple petals, and furthermore, over a period of several years, the offspring from all self-fertilizations within that line also had purple petals. Similarly, he grew a true-breeding parental line of white-flowered peas. When he crossed one line with the other by artificial fertilization, all the resulting offspring, called the first filial or Fl generation, had purple petals. Therefore, the purple petal phenotype was called dominant and the white petal phenotype recessive. After self-fertilization within the Fl generation, among the offspring, known as the second filial or F2 generation, 705 plants had purple and 224 plants had white petals out of a total of 929 F2 plants. This approximate 3:1 ratio (p-value .88) of the dominant phenotype to the recessive held for the other six characters as well. Mendel found that when F2 plants with the recessive phenotype were self-fertilized, the resulting offspring were all of the recessive type. However, when the F2 plants with the dominant phenotype were self-fertilized, 1/3 of them bred true, while the other 2/3 produced offspring of both phenotypes, in a dominant to recessive ratio of approximately 3:1. For instance, among • Department of Statistics, University of Chicago, Chicago, 1

n 60637.

2

MARY SARA McPEEK

100 F2 plants with purple petals, 36 bred true, while 64 had both purple and white-flowered offspring (the numbers of these were not reported). Mendel concluded that among the plants with the dominant phenotype, there were actually two types, one type which bred true and another hybrid type which bred in a 3:1 ratio of dominant to recessive. Mendel's explanation for these observations is that each plant has two units of heredity, now known as genes, for a given character, and each of these may be one of two (or more) types now known as alleles. Furthermore, in reproduction, each parent plant forms a reproductive seed or gamete containing, for each character, one of its two alleles, each with equal chance, which is passed on to a given offspring. For instance, in the case of petal color, the alleles may be represented by P for purple and p for white. (In this nomenclature, the dominant allele determines the letter of the alphabet to be used, and the dominant allele is uppercase while the recessive allele is lowercase.) Each plant would have one of the following three genotypes: pp, pP or PP, where types pp and PP are known as homozygous and type pP is known as heterozygous. Plants with genotype pp would have white petals, while those with genotype pP or PP would have purple petals. The two parental lines would be of genotypes pp and PP, respectively, and would pass on gametes of type p and P, respectively. The Fl generation, each having one pp parent and one PP parent, would then all be of genotype pP. A given Fl plant would pass on a gamete of type p or of type P to a given offspring, each with chance 1/2, independent from offspring to offspring. Then assuming that maternal and paternal gametes are passed on independently, each plant in the F2 generation would have chance 1/4 to be of genotype pp, 1/2 to be of genotype pP, and 1/4 to be of genotype PP, independently from plant to plant. In a large sample of plants, this multinomial model would result in an approximate 3:1 ratio of purple to white plants with all of the white plants and approximately 1/3 of the purple plants breeding true and the other approximately 2/3 of the purple plants breeding as in the Fl generation. Mendel's (1866) observations are consistent with this multinomial hypothesis. Mendel's model for the inheritance of a single character, in which the particles of inheritance from different gametes come together in an organism and then are passed on unchanged in future gametes has become known as Mendel's First Law. Mendel (1866) also considered the characters two at a time. For instance, he considered the form of the ripe seeds, with round (R) and wrinkled (r) alleles, and the color of the seed albumen, with yellow (Y) and green (y) alleles. Mendel crossed a true-breeding parental line in which the form of the ripe seeds was round and the color of the seed albumen was green (genotype RRyy) with another true-breeding parental line in which the form of the ripe seeds was wrinkled and the color of the seed albumen was yellow (genotype rrYY). When these characters were considered singly, round seeds were dominant to wrinkled and yellow albumen

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS

3

TABLE 1.1

The sixteen equally-likely genotypes among the F2 generation (top margin represents gamete contributed by father, left margin represents gamete contributed by mother).

RY Ry rY ry

RY RRYY RRYy RrYY RrYy

Ry RRYy RRyy RrYy Rryy

rY RrYY RrYy rrYY rrYy

ry RrYy Rryy rrYy rryy

was dominant to green. All of the Fl offspring had the yellow and round phenotypes, with genotype RrYy. In the F2 generation, according to the results of the previous experiments, 1/4 of the plants would have the green phenotype and the other 3/4 the yellow phenotype, and 1/4 would have the wrinkled phenotype and the other 3/4 the round phenotype. Thus, if these characters were assumed to segregate independently, we would expect to see 1/16 green and wrinkled, 3/16 yellow and wrinkled, 3/16 green and round, and 9/16 yellow and round, i.e. these phenotypes would occur in a ratio of 1:3:3:9. The experimental numbers corresponding to these categories were 32, 101, 108, and 315, respectively, which is consistent with the 1:3:3:9 ratio (p-value .93). Mendel further experimented with these F2 plants to verify that each possible combination of gametes from the Fl generation was, in fact, equally likely (see Table 1.1). From these and other similar experiments in which characters were considered two or three at a time, Mendel concluded that the characters did segregate independently. The hypothesis of independent segregation has become known as Mendel's Second Law. The above example provides an opportunity to introduce the concept of recombination. When two characters are considered, a gamete is said to be parental, or nonrecombinant, if the genes it contains for the two characters were both inherited from the same parent. It is said to be recombinant if the genes it contains for the two characters were inherited from different parents. For instance, in the previous example, an Fl individual may pass on to an offspring one of the four gametes, RY, Ry, rY, or ry. Ry and r Yare the parental gametes, because they are each directly descended from parental lines. RY and ry are recombinant gametes because they represent a mixing of genetic material which had been inherited separately. Mendel's Second Law specifies that a given gamete has chance 1/2 to be a recombinant. Fisher (1936) provides an interesting statistical footnote to Mendel's work. His analysis of Mendel's data shows that the observed numbers of plants in different classes actually fit too well to the expected num-

4

MARY SARA McPEEK

bers, given that the plant genotypes are supposed to follow a multinomial model (overall p-value .99993). That Mendel's data fit the theoretical ratios too well suggests some selection or adjustment of the data by Mendel. Of course, this in no way detracts from the brilliance and importance of Mendel's discovery. 2. Linkage and recombination. Mendel's work appeared in 1866, but languished in obscurity until it was rediscovered by Correns (1900), Tschermak (1900) and de Vries (1900). These three had independently conducted experiments similar to Mendel's, verifying his results. This began a flurry of research activity. Correns (1900) drew attention to the phenomenon of complete gametic coupling or complete linkage, in which alleles of two or more different characters appeared to be always inherited together rather than independently, i.e. no recombination was observed between them. Although this seems to violate Mendel's Second Law, an obvious extension of his theory would be to assume that the genes for these characters are physically attached. Sutton (1903) formulated the chromosome theory of heredity, a major development. He pointed out the similarities between experimental observations on chromosomes and the properties which must be obeyed by the hereditary material under Mendel's Laws. In various organisms, chromosomes appeared to occur in homologous pairs, each pair sharing very similar physical characteristics, with one member of each pair inherited from the mother and the other from the father. Furthermore, during meiosis, i.e. the creation of gametes, the two chromosomes within each homologous pair line up next to each other, with apparently random orientation, and then are pulled apart into separate cells in the first meiotic division, so that each cell receives one chromosome at random from each homologous pair. In fact, the chromosomes each duplicate when they are lined up before the first meiotic division, so after that division, each cell actually contains two copies of each of the selected chromosomes. During the second meiotic division, these cells divide again, forming gametes, with each resulting gamete getting one copy of each chromosome from the cell. Still, the net result is that each gamete inherits from its parent one chromosome at random from each homologous pair. The chromosome theory of heredity provided a physical mechanism for Mendel's Laws if it were assumed that the independent Mendelian characters lay on different chromosomes, and that those which were completely linked lay on the same chromosome. An interesting complication to this simple story was first reported by Bateson, Saunders and Punnett (1905; 1906). In experiments on the sweet pea (Lathyrus odoratus), they studied two characters: flower color, with purple (dominant) and red (recessive) phenotypes, and form of pollen, with long (dominant) and round (recessive) phenotypes. They found that the two characters did not segregate independently, nor were they completely linked (see Table 2.1). When crosses were performed between a

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS

5

TABLE 2.1

The counts of observed and expected genotypes in Bateson, Saunders and Punnett's (1906) data. In each of the three subtables, the top margin represents form of pollen, and the left margin represents flower color.

P p

expected no linkage I L 1199.25 399.75 399.75 133.25

observed data L I 1528 106 117 381

expected complete linkage L I 1599 0 0 533

true-breeding parental line with purple flowers and long pollen (genotype PPLL) and one with red flowers and round pollen (genotype ppll), in the F2 generation, there were long and round pollen types and purple and red flowers, both in ratios of 3 to 1 of dominant to recessive types, following Mendel's First Law. However, among the purple flowered plants, there was a preponderance of long-type pollen over round in a ratio of 12 to 1, whereas among the red flowered plants, the round-type pollen was favored, with a ratio oflong to round type pollen of 1 to 3. The authors were baffled as to the explanation for this phenomenon which is now known as linkage or partial coupling, of which complete linkage or complete coupling is a special case. It was Thomas Hunt Morgan who was able to provide an explanation for Bateson, Saunders and Punnett's observations oflinkage and similar observations of his own on Drosophila melanogaster. Morgan (1911), building on a suggestion of de Vries (1903), postulated that exchanges of material, called crossovers, occurred between homologous chromosomes when they were paired during meiosis (see Figure 2.1). In the example of Bateson, Saunders, and Punnett (1905; 1906), if a parental line with purple flowers and long pollen were crossed with another having red flowers and round pollen, then the members of the Fl generation would each have, among their pairs of homologous chromosomes, a pair in which one of the chromosomes had genes for purple flowers and long pollen (PL) and the other had genes for red flowers and round pollen (pI). During meiosis, when these homologous chromosomes paired, if no crossovers occurred between the chromosomes in the interval between the genes for flower color and pollen form, then the resulting gamete would be of parental type, i.e. PL or pI. If crossing-over occurred between the chromosomes in the interval between the genes, the resulting gamete could instead be recombinant, PI or pL (see Figure 2.1). Without the crossover process, genes on the same chromosome would be completely linked with no recombination allowed, but they typically exhibit an amount of recombination somewhere

6

MARY SARA McPEEK

(b)

(a)

(d)

(e)

(e)

(f)

(a) During meiosis, each chromosome duplicates to form a pair of sister chromatids that are attached to one another at the centromere. The sister chromatids from one chromosome are positioned near those from the homologous chromosome, and those four chromatid strands become aligned so that homologous regions are near to one another. (b) At this stage, crossovers may occur, with each crossover involving a nonsister pair of chromatids. (c) At the first meiotic division, the chromatids are separated again into two pairs that are each joined by a centromere. (d) The resulting chromatids will be mixtures of the original two chromosome types due to crossovers. (e) In the second meiotic division, each product of meiosis receives one of the four chromatids. (j) depicts the same stage of meiosis represented by (b), but here only a portion of the length of the four chromatids is shown. Suppose that the interval depicted is flanked by two genetic loci. Consider the chromatid whose lower end is leftmost. That chromatid was involved in one crossover in the interval, thus its lower portion is dark and its upper portion is light, showing that it is a recombinant for the flanking loci. On the other hand, consider the chromatid whose lower edge is second from the left. That chromatid was involved in two crossovers in the interval, thus its lowermost and uppermost portions are both dark, showing that it is non-recombinant for loci at the ends of the depicted interval. In general, a resulting chromatid will be recombinant for an interval if it was involved in an odd number of crossovers in that interval.

FIG. 2.1.

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS

7

between perfect linkage (0% recombination) and independence (50% recombination). That the chance of recombination between genes on the same chromosome should be between 0 and 1/2 is a mathematical consequence of a rather general assumption about the crossover process, no chromatid interference, described later. Although we now know that crossing-over takes place among four chromosome strands, rather than just two, the essence of Morgan's hypothesis is correct. In diploid eukaryotes, during the pachytene phase of meiosis, the two chromosomes in each homologous pair have lined up next to each other in a very precise way, so that homologous regions are adjacent. Both chromosomes in each pair duplicate, and the four resulting chromosome strands, called chromatids are lined up together forming a very tight bundle. The two copies of one chromosome are called sister chromatids. Crossing-over occurs among the four chromatids during this phase, with each crossover involving a non-sister pair of chromatids. After crossing-over has occurred, the four resulting chromatids are mixtures of the original parental types. Following the two meiotic divisions, each gamete receives one chromatid. For genes on the same chromosome, a recombination occurs whenever the chromatid which is passed on to the gamete and which contains the two genes was involved in an odd number of crossovers between the genes (see Figure 2.1). 3. Linkage Analysis. A consequence of the crossover process, Morgan (1911) suggested, would be that characters whose genes lay closer together on a chromosome would be less likely to recombine because there would be a smaller chance of crossovers occurring between them. This is the key to linkage analysis: the smaller the amount of recombination observed between genes, i.e. the more tightly linked they are, the closer we could infer that they lie on a chromosome. This provides a way of locating genes relative to one another by observing the pattern of inheritance of the traits which they cause. It is remarkable that a comparison of various traits among family members may yield information on the microscopic structure of chromosomes. Despite many important advances in molecular biology since since Morgan's suggestion in 1911, linkage analysis is still a very powerful tool for localizing a gene of interest to a chromosome region, particularly because it may be used in cases where one has no idea where the gene is or how it acts on a biochemical level. Modern linkage analysis uses not only genes that code for proteins that produce observable traits, but also neutral markers. These are regions of DNA that are polymorphic, that is, they tend to differ from individual to individual, but unlike genes, the differences between alleles of neutral markers may have no known effect on the individual, although they can be detected by biologists. While these markers may not be of interest themselves, they can be mapped relative to one another on chromosomes and used as signposts against which to map genes of interest. Genes and

8

MARY SARA McPEEK

markers are both referred to as genetic loci. As an undergraduate student of Thomas Hunt Morgan, Sturtevant (1913) applied the principle of linkage to make the first genetic map. This consisted of a linear ordering of six genes on the X-chromosome of Drosophila, along with genetic distances between them, where he defined the genetic distance between two loci to be the expected number of crossovers per meiosis between the two loci on a single chromatid strand. He called this unit of distance one Morgan, with one one-hundredth of a Morgan, called a centiMorgan (cM), being the unit actually used in practice. Sturtevant (1913) remarked that genetic distance need not have any particular correspondence with physical distance, since as we now know, the crossover process varies in intensity along a chromosome. The crossover process generally cannot be observed directly, but only through recombination between the loci. For nearby loci, Sturtevant (1913) took the genetic distance to be approximately equal to the recombination fraction, i.e. proportion of recombinants, between them. Once he had a set of pairwise distances between the loci, he could order them. Of course, it is possible to have a set of pairwise distances which are compatible with no ordering, but in practice, with the large amount of recombination data typically obtained in Drosophila experiments, this does not occur. Sturtevant realized that the recombination fraction would underestimate the genetic distance between more distant loci, because of the occurrence of multiple crossovers. There are several obvious ways in which Sturtevant's (1913) method could be improved. First, the recombination fraction is not the best estimate of genetic distance, even for relatively close loci. Second, it is desirable to have some idea of the variability in the maps. Also, depending on what is known or assumed about the crossover process, it may be more informative to consider recombination events among several loci simultaneously. In order to address these issues properly it is necessary to have a statistical model relating observed recombinations to the unobserved underlying crossovers. We proceed to outline some of the issues involved. Haldane (1919) addressed the relationship between recombination and crossing-over through the notion of a map function, that is, a function M connecting a recombination probability r across an interval with the interval's genetic length d by the relation r = M(d). Haldane's best-known contribution is the map function he introduced, and which is now known by his name, M( d) = [1- exp( -2d)]/2. The Haldane map function arises under some very simple assumptions about the crossover process. Recall that crossing-over occurs among four chromatid strands, and that each gamete receives only one of the four resulting strands. We refer to the occurrence of crossovers along the bundle of four chromatid strands as the chiasma process. Each crossover involves exactly two ofthe four chromatids, so any given chromatid will be involved in some subset of the crossovers of the full chiasma process. The occurrence of crossovers along a given chromatid will be referred to as the crossover process. To obtain the Haldane map func-

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS

9

tion, assume first that the chiasma process is a (possibly inhomogeneous) Poisson process. Violation of the assumption is known as chiasIlla interference or crossover position interference. Second, assume that each pair of non-sister chromatids is equally likely to be involved in a crossover, independent of which were involved in other crossovers. This assumption is equivalent to specifying that the crossover process is obtained from the chiasma process by independently thinning (deleting) each point with chance 1/2. Violation of this assumption is known as chroIllatid interference, and the assumption itself is referred to as no chromatid interference (N CI). This pair of assumptions specifies a model for the occurrence of crossovers which is known as the No-Interference (NI) Illodel. Deviation from this model is known as interference, which encompasses both chiasma interference and chromatid interference. Since genetic distance is the expected number of crossovers d in an interval on a single chromatid strand, the assumption of NCI implies that the expected number of crossovers of the full chiasma process in the interval is 2d. Under the assumption of no chiasma interference, the chiasma process is then a Poisson process with intensity 2 per unit of genetic distance. To obtain the Haldane mapping function, we apply Mather's ForIllula (1935), which says that under the assumptionofNCI, r = [l-P(N = 0))/2, where r is the recombination probability across an interval, and N is the random variable corresponding to the number of crossovers in the chiasma process in that interval. Under the NI model, P(N = 0) = exp( -2d), giving the Haldane map function. Following is a well-known derivation of Mather's Formula (see e.g. Karlin and Liberman 1983): If we assume NCI, then each crossover has chance 1/2 to involve a given chromatid, independent of which chromatids are involved in other crossovers. In that case, if there are N crossovers in the chiasma process on an interval, with N > 0, then the chance of having i crossovers in the crossover process on a given chromatid is

1 2'

x -:-

1 2N -,

X --.

for 0 :S i :S N. On a given chromatid, a recombination will occur in the interval if the chromatid is involved in an odd number of crossovers in the interval. Thus, the chance of a recombination given that N > 0 crossovers have occurred in the chiasma process is

and the chance is 0 if N = 0, so the chance of a recombination is Pr(N > 0)/2. One consequence of Mather's Formula is that under NCI, the chance of recombination across an interval increases, or, at least, does not decrease,

10

MARY SARA McPEEK

as the interval is widened. Another is that the chance of recombination across any interval has upper bound 1/2 under Nel. These two observations appear to be compatible with virtually all published experimental results. Haldane's map function provides a better estimate of genetic distance than the recombination fraction used by Sturtevant (1913). Instead of estimating d by the observed value of r, one could instead plug the observed value ofr into the formula d = -1/2In(1- 21'). One could perform separate experiments for the different pairs of loci to estimate the genetic distances and hence obtain a map. Standard deviations could easily be attached to the estimates, since the number of recombinants in each experiment is binomial. One could also look at a number of loci simultaneously in a single experiment. Assuming that the experiment was set up so that all recombination among the loci could be observed, the data would be in the form of 2m counts, where m is the number of loci considered. This is because for each locus, it would be recorded whether the given chromosome contained the maternal or paternal allele at that locus. If we number the loci arbitrarily and assume that, for instance, the probability of maternal alleles at loci 1,3,4 and 5 and paternal alleles at loci 2 and 6 is equal to the probability of paternal alleles at loci 1,3,4 and 5 and maternal alleles at loci 2 and 6, then we could combine all such dual events and summarize the data in 2m - l counts. We index these counts by i, where i = (iI, i 2 , ... , i m- l ) E {O, l}m-l and ij = 0 implies that both loci ij and ij+l are from the same parent, i.e. there is no recombination between them, while i j = 1 implies that loci i j and ij+l are from different parents, i.e. they have recombined. Fisher (1922) proposed using the method of maximum likelihood for linkage analysis, and this is the method largely used today. We now describe the application, to the type of data described above, of the method of maximum likelihood using Haldane's NI model. This is the simplest form of what is known as multilocus linkage analysis. In a given meiosis, the NI probability of the event indexed by i is simply

m-l

Pi

= II oy (1 j=l

m-l

OJ )l-i j

= 1/2 II (1- e- 2dj )ij(1 + e- 2dj )1-i j , j=l

where OJ is the probability of recombination between loci i j and ij+l and dj is the genetic distance between them. The formula reflects the fact that under NI, recombination in disjoint intervals is independent. Note that the formulation depends crucially on the presumptive order of the markers. The same recombination event will have a different index i if the order of the markers is changed, and a different set ofrecombination probabilities or genetic distances will be involved in the above formula. For a given order,

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS 11

one can write down the likelihood of the data as

where ni is the number of observations of type i. The likelihood is maximized by

OJ =

2::

i:ij=l

ni -;-

2:: ni' i'

for all j, that is, just the observed proportion of recombinants between loci i j and i j +1. Since the assumption of NCI implies OJ :=:; 1/2, one usually takes the constrained maximum likelihood estimate, OJ = min(I:i:ij=l ni-;I:il ni', 1/2). All other recombination fractions between non-adjacent pairs of loci can be estimated by using the fact that under NI, if loci A, B, and C are in order ABC, then the chance of recombination between A and C, oAG, is related to the chance of recombination between A and B, 0AB, and that between Band C, OBG, by the formula OAG = OAB(I- OBG) + (1oAB )OBG. The variance in the estimate OJ is OJ (1- OJ )/n, and OJ and OAk are independent for j # k. Thus, under the assumption of NI, the multilocus linkage analysis reduces to a pairwise analysis of recombination between adjacent markers when the data are in the form given above. To estimate order, one may consider several candidate orders and maximize the appropriate likelihood under each of them. The maximum likelihood estimate of order is that order whose maximized likelihood is highest. When one wants to map a new locus onto a previously existing map, one can follow this procedure, considering as candidate orders those orders in which the previously mapped loci are in their mapped positions and the new locus is moved to different positions between them. Outside of the world of experimental organisms, the reality of multilocus linkage analysis is quite different from what has been portrayed so far. Humans cannot be experimentally crossed, and therefore human linkage data does not fit neatly into 2m - 1 observed counts. In some individuals, maternal and paternal alleles may be identical at some loci, so that recombination involving those loci cannot be observed in their offspring. Ancestors may not be available for the analysis, so it may not be possible to definitively determine whether particular alleles are maternally or paternally inherited. When some information is missing, the information that is available may be in the form of complicated pedigrees representing interrelationships among individuals. In these cases, multilocus linkage analysis under NI does not reduce to a pairwise analysis. Maximization of the NI likelihood is an extremely complex undertaking and is the subject of considerable current research. For an introduction to linkage analysis in humans, see Ott (1991).

12

MARY SARA McPEEK

Most linkage analyses, whether in humans or in experimental organisms, are today still performed using the NI model. In fact, the phenomenon of interference is well-documented in a wide range of organisms. In their experiments on Drosophila, Sturtevant (1915) and Muller (1916) noticed that crossovers did not seem to occur independently, but rather the presence of one seemed to inhibit the formation of another nearby. From recombination data, it may be impossible to distinguish whether observed interference is due to chromatid interference, chiasma interference, or both, because of a lack of identifiability. If the chiasma and crossover processes themselves could be observed, this would eliminate the difficulty. In certain fungi such as Saccharomyces cerevisiae, Neurospora crass a, and Aspergillus nidulans, the problem is made less acute for two reasons. First of all, these genomes are very well mapped, with many closely spaced loci, and for certain very near loci, the observation of a recombination or not between them is nearly equivalent to the observation of a crossover or not between them. Secondly, in these organisms, all four of the products of meiosis can be recovered together and tested for recombination. This type of data is known as tetrad data, as opposed to single spore data in which only one of the products of meiosis is recovered. As a result of these features, some tetrad data give approximate discretized versions of the chiasma and crossover processes. From this sort of data, it is clear that chiasma or position interference is present, and that the occurrence of one crossover inhibits the formation of another nearby (Mortimer and Fogel 1974). The existence and nature of chromatid interference has proved more difficult to detect than position interference. Statistical tests of chromatid interference based on generalizations of Mather's formula demonstrate some degree of chromatid interference, but the results are not consistent from experiment to experiment (Zhao, McPeek, Speed, 1995). Various crossover models that allow for interference of one or both types have been put forward and examined. These include Fisher, Lyon and Owen (1947), Owen (1949,1950), Carter and Robertson (1952), Karlin and Liberman (1979), Risch and Lange (1979), Goldgar and Fain (1988), King and Mortimer (1990) Foss, Lande, Stahl, and Steinberg (1993), McPeek and Speed (1995), Zhao, Speed, and McPeek (1995). The model used overwhelmingly today in linkage analysis is still the no interference model, due to its mathematical tractability. However, the chi-square model of Foss, Lande, Stahl, and Steinberg (1993), McPeek and Speed (1995), and Zhao, Speed, and McPeek (1995) may now be a viable contender. 4. Conclusion. Mendel showed that through careful quantitative observation of related individuals, the mechanism of heredity of traits could be studied. Linkage analysis, proposed by Morgan in 1911 and still used today, is equally startling in that it is based on the principle that careful quantitative observation of related individuals can actually illuminate the positions of genes on chromosomes. While the phenomenon of linkage

AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS

13

between traits allows one to infer that their genes are on the same chromosome, it is the phenomenon of recombination, that has the effect of varying the degree of linkage, which allows these traits to be mapped relative to one another on the chromosome. One of the most useful characteristics of linkage analysis is the fact that it can be used to map genes that are identified only through their phenotypes, and about which one may have no other information. 5. Recommended reading. Whitehouse (1973) gives a thorough historical introduction to genetics. Bailey (1961) is a detailed mathematical treatment of genetic recombination and linkage analysis, while Ott (1991) is an introductory reference for genetic linkage analysis in humans. Acknowledgements. I am greatly indebted to Terry Speed for much of the material in this manuscript. This work was supported in part by NSF Grant DMS 90-05833 and NIH Grant R01-HG01093-01. REFERENCES Bailey, N. T. J. (1961) Introduction to the Mathematical Theory of Genetic Linkage, Oxford University Press, London. Bateson, W., Saunders, E. R., and Punnett, R. C. (1905) Experimental studies in the physiology of heredity, Rep. Evol. Comm. R. Soc., 2: 1-55, 80-99. Bateson, W., Saunders, E. R., and Punnett, R. C. (1906) Experimental studies in the physiology of heredity, Rep. Evol. Comm. R. Soc., 3: 2-11. Carter, T. C., and Robertson, A. (1952) A mathematical treatment of genetical recombination using a four-strand model, Proc. Roy. Soc. B, 139: 410-426. Correns, C. (1900) G. Mendels Regel iiber das Verhalten der Nachkommenschaft der Rassenbartarde, Ber. dt. bot. Ges., 18: 158-168. (Reprinted in 1950 as "G. Mendel's law concerning the behavior of progeny of varietal hybrids" in Genetics, Princeton, 35: suppl. pp. 33-41). de Vries, H. (1900) Das Spaltungsgesetz der Bastarde, Ber. dt. bot. Gesell., 18: 83-90. (Reprinted in 1901 as "The law of separation of characters in crosses", J. R. Hort. Soc., 25: 243-248. de Vries, H. (1903) Befruchtung and Bastardierung, Leipzig. (Reprinted as "Fertilization and hybridization" in C. S. Gager (1910) Intracellularpangenesis including a paper on fertilization and hybridization, Open Court Publ. Co., Chicago, pp. 217-263). Fisher, R. A. (1922) The systematic location of genes by means of crossover observations, American Naturalist, 56: 406-411. Fisher, R. A. (1936) Has Mendel's work been rediscovered? Ann. Sci., 1: 115-137. Fisher, R. A., Lyon, M. F., and Owen, A. R. G. (1947) The sex chromosome in the house mouse, Heredity, 1: 335-365. Foss, E., Lande, R., Stahl, F. W., Steinberg, C. M. (1993) Chiasma interference as a function of genetic distance, Genetics, 133: 681-691. Goldgar, D. E., Fain, P. R. (1988) Models of multilocus recombination: nonrandomness in chiasma number and crossover positions, Am. J. Hum. Genet., 43: 38-45. Haldane, J. B. S. (1919) The combination of linkage values, and the calculation of distances between the loci of linked factors, J. Genetics, 8: 299-309. Karlin, S. and Liberman, U. (1979) A natural class of multilocus recombination processes and related measures of crossover interference, Adv. Appl. Prob., 11: 479-501. Karlin, S. and Liberman, U. (1983) Measuring interference in the chiasma renewal formation process, Adv. Appl. Prob., 15: 471-487. King, J. S., Mortimer, R. K. (1990) A polymerization model of chiasma interference and corresponding computer simulation, Genetics, 126: 1127-1138.

14

MARY SARA McPEEK

Mather, K. (1935) Reduction and equational separation of the chromosomes in bivalents and multivalents, J. Genet., 30: 53-78. McPeek, M. S., Speed, T. P. (1995) Modeling interference in genetic recombination, Genetics, 139: 1031-1044. Mendel, G. (1866) Versuche tiber Pflanzenhybriden, Verh. naturJ. Ver. Bruenn, 4: 3-44. (Reprinted as "Experiments in plant-hybridisation" in Bateson, W. (1909) Mendel's principles of heredity, Cambridge Univ. Press, Cambridge, pp. 317-361.) Morgan, T. H. (1911) Random segregation versus coupling in Mendelian inheritance, Science, 34: 384. Mortimer, R. K. and Fogel, S. (1974) Genetical interference and gene conversion, in R. F. Grell, ed., Mechanisms in Recombination, Plenum Publishing Corp., New York, pp. 263-275. Muller, H. J. (1916) The mechanism of crossing-over, American Naturalist, 50: 193-221, 284-305,350-366,421-434. Ott, Jurg (1991) Analysis of human genetic linkage, rev. ed., The Johns Hopkins University Press, Baltimore. Owen, A. R. G. (1949) The theory of genetical recombination, I. Long-chromosome arms. Proc. R. Soc. B, 136: 67-94. Owen, A. R. G. (1950) The theory of genetical recombination, Ad'll. Genet., 3: 117-157. Risch, N. and Lange, K. (1979) An alternative model of recombination and interference, Ann. Hum. Genet. Lond., 43: 61-70. Sturtevant, A. H. (1913) The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association, J. Exp. Zool., 14: 43-59. Sturtevant, A. H. (1915) The behavior of the chromosomes as studied through linkage, Zeit. J. indo Abst. u. Vererb., 13: 234-287. Sutton, W. S. (1903) The chromosomes in heredity, Bioi. Bull. mar. bioi. Lab., Woods Hole, 4: 231-248. Tschermak, E. von (1900) Uber kiinstliche Kreuzung bei Pisum sati'Uum, Ber. dt. bot. Ges., 18: 232-239. (Reprinted in 1950 as "Concerning artificial crossing in Pisum sati'Uum" in Genetics, Princeton, 26: 125-135). Whitehouse, H. L. K. (1973) Towards an understanding of the mechanism of heredity, St. Martin's Press, New York. Zhao, H., McPeek, M. S., Speed, T. P. (1995) A statistical analysis of chromatid interference, Genetics, 139: 1057-1065. Zhao, H., Speed, T. P., McPeek, M. S. (1995) A statistical analysis of crossover interference using the chi-square model, Genetics, 139: 1045-1056.

MONTE CARLO METHODS IN GENETIC ANALYSIS SHILl LIN" Abstract. Many genetic analyses require computation of probabilities and likelihoods of pedigree data. With more and more genetic marker data deriving from new DNA technologies becoming available to researchers, exact computations are often formidable with standard statistical methods and computational algorithms. The desire to utilize as much available data as possible, coupled with complexities of realistic genetic models, push traditional approaches to their limits. These methods encounter severe methodological and computational challenges, even with the aid of advanced computing technology. Monte Carlo methods are therefore increasingly being explored as practical techniques for estimating these probabilities and likelihoods. This paper reviews the basic elements of the Markov chain Monte Carlo method and the method of sequential imputation, with an emphasis upon their applicability to genetic analysis. Three areas of applications are presented to demonstrate the versatility of Markov chain Monte Carlo for different types of genetic problems. A multilocus linkage analysis example is also presented to illustrate the sequential imputation method. Finally, important statistical issues of Markov chain Monte Carlo and sequential imputation, some of which are unique to genetic data, are discussed, and current solutions are outlined.

1. Introduction. Most human genetic analyses require computation of probabilities and likelihoods of genetic data from pedigrees. Statistical methods and computational algorithms have been developed to accomplish this task. The most efficient ones have been based on a recursive algorithm. The simplest case of which was developed by Elston and Stewart (1971). Successive algorithms for more complex cases were given in Lange and Elston (1975), Cannings, Thompson and Skolnick (1978), Lange and Boehnke (1983), and Lathrop et al. (1984). Unfortunately, these methods are sometimes incapable of handling the data that geneticists and genetic epidemiologists are facing today. The past decade has seen an explosive growth of molecular genetic technology which has led to a massive amount of DNA data becoming available to researchers; see Murray et al. (1994) for a comprehensive human linkage map. It is imperative that these data be utilized as much as possible to maximize the power of, for example, constructing genetic maps, mapping disease genes and finding plausible genetic models. Practical and theoretical bounds on computational feasibility of probabilities and likelihoods become a major limitation of genetic analysis and a great challenge in statistical genetics. A routine multipoint linkage analysis using LINKAGE developed by Lathrop et al. (1984) may take a few weeks or even months to do certain problems, such as those encountered by Schellenberg et al. (1990, 1992) and Easton et al. (1993). This is too impractical and expensive and hence unacceptable for regular screening processes. Advanced computing technology and good computer programming practice have been used to overcome some of the computational difficul• Department of Statistics, University of California, Berkeley, CA 94720.

15

16

SHILl LIN

ties with multipoint analysis. Cottingham et al. (1993) and Schaffer et al. (1994) demonstrated that basic computer science techniques such as "common sub-expression elimination" by factoring expressions to reduce arithmetic evaluations, can be used to improve the performance of algorithms. These techniques have proved to be quite effective in exploiting the basic biological features such as the "sparsity" of the joint genotype array and the "similarity" of genotypes. Furthermore, Miller et al. (1991), Goradia et al. (1992), and Dwarkadas et al. (1994) investigated the usage of parallel computers as another way to achieve speedup, which has become less and less expensive with the advance of computer technology. However, as pointed out by Cottingham et al. (1993), although the improvements are substantial, there will always be more difficult problems that geneticists want to solve and will demand yet more computer power. Therefore, good computer programming practice should be combined with advances in statistical methods to achieve even greater improvements. In the last few years, a completely different approach involving the estimation of probabilities and likelihoods via the Monte Carlo method has emerged. We include under this heading the sequential imputation approach of Kong et al. (1994) and Irwin et al. (1994), and the Markov chain Monte Carlo (MCMC) approaches of Lange and Matthysse (1989), Sheehan (1990), Lange and Sobel (1991), Thompson and Guo (1991), Guo and Thompson (1992), Thomas and Cortessis (1992), Sheehan and Thomas (1993), Lin et al. (1993), and Thompson (1994a,b). These methods have been successfully applied to various problems, and some of which will be demonstrated later as examples. 2. Monte Carlo methods in genetic analysis. Although Monte Carlo simulation methods have been proposed for some time in human pedigree analysis, they have only recently emerged as a practical alternative to analytical statistical methods. Traditionally, simulation methods have been used to study some unknown properties of analysis methods, or to compare the performances of alternative methods. The use of Monte Carlo simulation methods as tools to provide solutions to problems for which analytical solutions are impractical was not pursued until quite recently. Preliminary investigations have revealed that these methods are of particular relevance to genetic analysis problems for which complex traits, complex genealogical structures or large numbers of polymorphic loci are involved. Simulation methods in genetics can be traced back to the 1920's, when Wright and McPhee (1925) estimated inbreeding by making random choices in tracing ancestral paths for livestocks. Ott (1974,1979) advocated the use of simulation methods as a tool for human pedigree analysis, but this did not receive much attention at the time. More recently, a straightforward Monte Carlo method known as gene-dropping was proposed by MacCluer et al. (1986). First, genotypes for founders are generated according to

MONTE CARLO METHODS IN GENETIC ANALYSIS

17

the relevant population probabilities. Next, gene flow down the pedigree is simulated according to the rules of inheritance postulated by Mendel (1865). Finally, outcomes which are inconsistent with the observed phenotypes are discarded. This results in a random sample from the genotypic configuration space. Approximations to any desired probabilities can thus be obtained by Monte Carlo methods. In small pedigrees, the method will successfully produce realizations of genotypes consistent with phenotypes. However, this method does not work well in pedigrees of even moderate size for in such cases it is extremely unlikely to give samples which are compatible with observed phenotypes. Ploughman and Boehnke (1989) described a Monte Carlo method to estimate the power of a study to detect linkage for a complex genetic trait, given a hypothesized genetic model for the trait. They proposed to calculate conditional probabilities recursively and then sample from the posterior genotype distribution conditional on the observed phenotypes at the trait locus. These conditional probabilities are generated in the process of calculating the likelihood of a pedigree by using the procedure of Lange and Boehnke (1983), a generalization of Elston and Stewart (1971). Then marker genotypes are subsequently simulated conditional on the simulated trait genotypes (Boehnke, 1986). This method reduces exact computations on two loci jointly to exact computations on the trait locus only. However it is necessary to store a large amount of intermediate data, especially when the method is extended to complex pedigrees with inbreeding loops. The limitations of this method are the same as other methods based on the Elston-Stewart algorithm. Ott (1989) also described a simulation method for randomly generating genotypes at one or more marker loci, given observed phenotypes at loci linked among themselves and with the marker. In the past decade, statisticians have realized that many problems previously thought intractable can be solved fairly straightforwardly by Markov chain Monte Carlo (MCMC) methods. The method was proposed long ago and has been widely used in statistical physics, see Metropolis et al. (1953) for the original work, Rikvold and Gorman (1994) and references therein for a review of recent works. Since the work of Geman and Geman (1984), MCMC has received a great deal of attention in the statistical community, especially in Bayesian computation. The papers of Tanner and Wong (1987), Gelfand and Smith (1990), Smith and Roberts (1993) and Gilks et al. (1993) are a few examples of recent research in this area. Following its entry into statistics, MCMC was quickly adapted to genetic analysis. The basic idea is to obtain dependent samples (essentially realizations of Markov chains) of underlying genotypes consistent with the observed phenotypes. Probabilities and likelihoods can then be estimated from these dependent samples. Lange and Matthysse (1989) investigated the feasibility of one MCMC method, the Metropolis algorithm, to simulate genotypes for traits conditional upon observed data. Independent

18

SHILl LIN

of the work of Lange and Matthysse, Sheehan, in her 1990 PhD thesis, investigated the use of the Gibbs Sampler of Geman and Geman (1984) to sample genotypes underlying simple discrete genetic traits observed on large pedigrees. She demonstrated that, for a trait at a single diallelic locus, the Gibbs sampler provided quite accurate estimates of the ancestral probabilities of interest in a complex pedigree of Greenland Eskimos. Guo and Thompson (1992) showed that the Gibbs sampler can also be applied to quantitative traits. Monte Carlo EM algorithms were developed, in conjunction with Monte Carlo likelihood ratio evaluation by Thompson and Guo (1991), to estimate parameters of complex genetic models. Lange and Sobel (1991) and Thomas and Cortessis (1992) developed MCMC methodologies relevant for two-point linkage analysis. The validity of these methods rests on the crucial assumption that any locus involved must be diallelic. This is undesirable, particularly in linkage analysis, because multiallelic markers in general are much more informative, and thus highly preferred. The research of Sheehan and Thomas (1993), Lin et al. (1993, 1994b) and Lin (1995) have addressed this issue so that MCMC methods can be applied to more realistic genetic data where other methods fail. The sequential imputation method of Kong et al. (1994) is another Monte Carlo method that has recently been implemented for multilocus linkage problems by Irwin et al. (1994) and Irwin (1995). It is essentially an importance sampling technique (see e.g., Hammersley and Handscomb (1964)) in which missing data on genetic loci are imputed conditional on the observed data. Genetic loci are ordered and processed one at a time. Previously imputed values are treated as observed for later conditioning. By repeating the process for many times, a collection of complete data sets are obtained with associated weights to assure appropriate representation of the probability distribution. This method has been demonstrated to be a computationally efficient approach to problems with a large number of loci and simple pedigrees, i.e. pedigrees without loops. For pedigrees with many loops, it has the same limitations as other methods based on the Elston-Stewart algorithm. The rest of this paper is devoted to the discussion of methodology and applications of MCMC and sequential imputation to genetic problems. We first review the basic MCMC algorithm and how it can be applied to genetic analysis. We then present three applications of MCMC to genetic problems. The first application is on inference of ancestral probabilities on complex pedigrees, the second application is on estimating likelihoods in multipoint linkage analysis, and the last is on inference with complex traits. The method of sequential imputation and its application to a multilocus linkage problem will follow. Finally, we discuss several specific statistical issues associated with the applications of MCMC and sequential imputation to genetic problems.

MONTE CARLO METHODS IN GENETIC ANALYSIS

19

3. Markov chain Monte Carlo methods. Whether one is interested in computing the probability that a certain individual carries a gene for a recessive trait, or the multilocus likelihood function in a linkage analysis, the problem can almost always be viewed as estimating an expectation with respect to the conditional genotype distribution Pe(g I d). Here, g is the configuration of genotypes (they could be either single locus or multilocus, depending on the context of the application), d is the observed phenotypic data and () is a vector of parameters. Thus, the objective is to simulate from the distribution Pe(g I d), so that the relevant expectation can be estimated by a sample average. Note that although

Pe(g I d) ex Pe(d I g)Pe(g), computation of the normalizing constant

Pe(d) = LPe(d I g)Pe(g) g

is usually formidable. Since the distribution of interest Pe(g I d) is therefore known only up to a normalizing constant, direct simulation from it is impossible. Note that Pe( d) is the likelihood and is sometimes of interest itself. The Metropolis-Hastings family of algorithms are MCMC methods which provide ways of simulating dependent realizations that are approximately from a distribution that is known only up to a constant of proportionality (Hastings, 1970). In other words, Metropolis-Hastings algorithms are methods of constructing Markov chains with the distribution of interest as a stationary distribution. In the genetic analysis setting discussed in the current paper, the distribution of interest is discrete and the state space is finite. The general Hastings algorithm employs an auxiliary function q(g*, g) such that q(., g) is a probability distribution for each g. The following algorithm defines the required Markov chain (Hastings, 1970). Let g(l) be the starting state of the Markov chain. Successive states are then generated iteratively. Given that the current state is g(t), t = 1,2,···, generation of the next state g(t + 1) follows these steps: 1. Simulate a candidate state g* from the proposal distribution q(., g(t)) as specified above; 2. Compute the Hastings acceptance probability

r

= r(g

*

, g(t))

. { Pe(g* I d) q(g(t), g*) } Pe(g(t) I d) q(g*, g(t))' 1 ,

= mm

which is so designed that the Markov chain will indeed have P = Pe(- I d) as a stationary distribution; 3. Accept g* with probability r. That is, with probability r, the Markov chain moves to g(t+1) = g*. Otherwise, the chain remains at g(t + 1) = g(t).

20

SHILl LIN

It can be verified easily that the distribution of interest Pe(g I d) is indeed a stationary distribution of the Markov chain just defined (Lin, 1993). Note that P is used in the algorithm only through ·the ratio in computing the Hastings acceptance probability, that is why we emphasize that P only needs to be known up to a constant. Provided that the auxiliary function is chosen so that the chain is ergodic, that is, aperiodic and irreducible, realizations of the chain (after a sufficient number of steps for convergence) can be regarded as from Po(g I d). These realizations can then be used to estimate the required expectation. Performance of the estimate depends on the choice of the auxiliary function q. A special case of the Hastings algorithm is the Metropolis algorithm (Metropolis et al., 1953). If the auxiliary function is symmetric, that is, q(g*,g) = q(g,g*), then the acceptance probability is min{Pe(g* I d) / Po (g I d), I}. Therefore, if the candidate state is at least as probable as the current state, then the process moves to the new state, otherwise, the process moves to the new state according to the odds ratio of the proposal state and the current state. Another special case of the Hastings algorithm is the Gibbs sampler (Geman and Geman, 1984). Specifically, for the Gibbs sampler, each coordinate of g = (gl,g2," ',gn) is updated in turn, where gi is the genotype (again, it could be single-locus or multi-locus) of the ith individual in the pedigree and n is the size of the pedigree. When updating the ith coordinate gi, the proposal distribution q is chosen to be pJi)(gi I g_i,d), where g-i = (gl,···,gi-1,gi+1,···,gn), the configuration of genotypes of individuals in the pedigree except the ith individual. Denote g* = (gl,···,gi-1,gi,gi+1,···,gn). Since Pe(g* I d)pJi)(9i I g:'i,d) = Po(g I d)pJi)(gi I g-i,d) for any i E {l,···,n}, any proposed candidate g* is accepted with probability 1. When all the coordinates are updated once, that constitutes a scan. Assuming Mendelian segregation, the conditional genotype distribution pJi\gi I g-i, d) of an individual for Gibbs updating depends only on the phenotype of the individual and the current genotypes of the neighbors, who are the parents (if not a founder), spouses and offspring. Hence the Gibbs sampler is easy to implement due to this local dependence structure. However, one should note that the fact of no rejection is not necessarily advantageous; the Gibbs sampler can make only small changes in g. Nevertheless, the Gibbs sampler has been used extensively in genetic analysis, not only because it is easy to sample from the conditional distributions, but also because other proposal distributions may result in rejecting almost all the proposed candidate states. Standard errors are frequently employed to assess the estimates. If a Markov chain is aperiodic and irreducible with a finite state space, then the following central limit theorem holds. That is, in estimating an expectation

MONTE CARLO METHODS IN GENETIC ANALYSIS

J-l

21

Ep(f(g)) by 1 N

Ii = N

L

f(g(t)),

t=l

we may assert that

where f is P-integrable and a} can be estimated. Following Hastings (1970), we divide the realization {g(t); 1 ~ t ~ N} into L batches, each of which consists of K consecutive observations (K L = N) of the genotypic configuration g. Let iii denote the [th batch mean, then 2

Sp

L (~ ~)2 ~ J-l1-J-l

= L1=1

L(L -1)

provides a satisfactory estimate of a} / N, provided the batch means are not significantly autocorrelated. Hence sp is the estimated Monte Carlo standard error of Ii. In theory, MCMC methods can be easily applied to estimate probabilities and likelihoods of interest in many areas of applications. Many technical problems exist in practice, however. Specifically, the following are some of the main problems associated with the application of MCMC to genetic analysis. First of all, finding a starting configuration of genotypes which is consistent with the observed data is a non-trivial problem. Furthermore, a Markov chain constructed from the Gibbs sampler may not be irreducible, a necessary requirement for the inference to be valid. The distribution of interest Pe(g I d) usually has multiple modes, which is another difficult problem facing MCMC exploration of the probability surface. These problems will be addressed in detail in section 6. 4. Applications of MCMC to three genetic problems. Three specific types of problems using MCMC methods are discussed and possible solutions are described in this section. Genetic pedigree analysis consists of three components: the genealogical structure (pedigree), the mode of inheritance (genetic model) for the trait of interest, and the observed data (phenotypes). Our first application assumes all the three components are known, and that one is primarily interested in the probability that a certain individual carries a specific gene. This type of problem usually occurs with large and complex genealogical structures. The second application is to map a locus to a known map of markers using multipoint linkage analysis, where the number of markers and number of alleles per marker are too large to be treated by analytical methods using standard packages. The third application involves inference concerning the mode of inheritance of a complex trait, assuming that the other two components are known.

22

SHILl LIN

Complex models are usually needed to describe this type of genetic data adequately. These three examples demonstrate that MCMC methods are techniques which can be applied to a large class of problems that are not amenable to treatment by standard exact methods and pedigree analysis packages. 4.1. Inference of ancestral probabilities on complex pedigrees MCMC methods are applied here to estimate the probabilities that specific founder individuals carry a gene, given the phenotypic data on large pedigrees which are also very complex, i.e. with many inbreeding loops. These probabilities may be of interest in population genetics or genetic counseling. One example of such is a problem which concerns the estimation of allele frequency of the B-gene among Greenland Eskimos (Sheehan, 1990). Another example is the estimation of founder carrier probabilities for a very rare recessive lethal trait in a Hutterite genealogy (Lin et al., 1994a). Genetic models for this type of problems are usually quite simple. However, these populations are often isolated because of geographic or religious reasons. The pedigrees are thus very complex, with many loops, which make it impossible to compute exactly using standard methods of pedigree analysis, due to insufficient computer memory. Figure 4.1 depicts the complexity of the Hutterite genealogy studied by Lin et al. (1994a). Two Hutterite families were observed to segregate the very rare recessive lethal infantile hypophosphatasia. The ancestors of the two affected individuals were traced back 11 generations to 48 founders, giving a 221-member pedigree. The genealogy of the Greenland Eskimos studied by Sheehan (1990) is even more complex and will not be shown here. By employing a MCMC algorithm with an appropriately chosen auxiliary function q, one obtains N Monte Carlo realizations g(t), t = 1, ... , N. These realizations can be regarded (approximately) as from P(g I d), the joint posterior distribution of genotypes on the pedigree, conditional on the phenotypic data. From these realizations, any expectation under the conditional distribution can be estimated. To be specific, consider a recessive lethal trait with A denoting the normal allele and a the disease allele. Then the estimate of the probability that individual j was a carrier is h

P(Yj

1

= Aa) = N

E I(Yj(t) = Aa), N

t=l

where I is the indicator function. That is, the estimated probability is simply the proportion of realizations in which j has genotype Aa. Lin et al. (1994a) used a modified Gibbs sampler with N = 1,000,000 realizations to obtain their results. There, they were mainly interested in which one of the 48 founders was most likely to have introduced the mutant gene into the population. The estimated probabilities show that founders 1,2,3,4,6 and 7 (shaded grey in Figure 4.1) were all much more probable carriers than the other founders. Founder 1 (with probability 0.197) was

MONTE CARLO METHODS IN GENETIC ANALYSIS

23

...... FIG. 4.1. Marriage node graph of a Hutterite pedigree with the two individuals affected by HOPS shaded black. The six founders of main interest, shaded grey, are 1, 2, 3, 4, 6 and 7.

by far the most probable carrier, which is expected by simply observing relationships of individuals in the pedigree. The carrier probabilities of these six founders and their estimated standard errors are shown in table 4.1. Founders 17, 18,56,57 and 58 (also shaded grey in Figure 4.1) were the only additional founders whose probabilities of being carriers were higher than 5%. See Lin et al. (1994a) for more details. 4.2. Estimation of likelihoods in multipoint linkage analysis Computing multilocus likelihood is an essential part of multipoint linkage analysis. However, due to the large amounts of data now available, standard methods and algorithms, such as LINKAGE (Lathrop et al., 1984), are sometimes impractical. Ott (1991) provides a detailed account of, and basic genetic elements pertinent to linkage analysis. The computation required for the likelihood analysis using LINKAGE, grows exponentially. Factors that contribute to increased computational demand are mostly due to the following: number of markers, number of alleles per marker, number of unobserved individuals and degree of complexity of a pedigree (Cottingham

24

SHILl LIN TABLE 4.1

Estimated posterior carrier probabilities, conditional on the data, obtained by Lin et al. (1994a), for the Hutterite pedigree and data in figure 4.1. Listed are the six founders with relatively higher probabilities of being carriers.

founder label 1 2 3 4

6 7

carner probability 0.197 0.099 0.109 0.109 0.105 0.113

standard error 0.012 0.005 0.006 0.006 0.010 0.010

et al., 1993). The lod score of multipoint linkage analysis is the common logarithm of the likelihood ratio Ld L o, where h is the likelihood under linkage and Lo is the likelihood in the absence of linkage. In the context of mapping a new locus to a known map of markers, the multipoint lod score can be expressed as lod(e) = log (L(e)jL(e o )), where e specifies the map position of the locus in question relative to the known marker map, and eo is the special case in which the new locus is unlinked to any of the markers. Note that

L(e)

= Pe(d) = L

Pe(d I g)Pe(g),

g

where g = (gl,"', gn) is a configuration of multilocus genotypes. A straightforward approximation of L( e) would be using the method of genedropping as described in section 2. Outcomes which are incompatible with the observed phenotypic data are discarded and the likelihood is approximated by averaging over the remaining ones. As pointed out earlier, this method does not work in pedigrees of even moderate size because it is extremely unlikely to produce samples which are compatible with observed phenotypes in such cases. Note that, lod(e)

The last expression of the above formulae is the conditional expectation with respect to the distribution Peo(g I d).

MONTE CARLO METHODS IN GENETIC ANALYSIS

25

Estimation of the whole lod score curve as a function of e can therefore be done by simulation at a single eo. Specifically, let g(t) : t = 1,2,···, N, be N realizations of an ergodic Markov chain with POD (g I d) as its equilibrium distribution. Then, N

" Po(g(t), d) 1og -1 'L..J N t=l POD(g(t), d)

provides an estimate for lod(e). For e close enough to eo, the estimate will be good, as the sampling distribution POD is not far apart from the target distribution Po. Therefore, it is desirable to sample at several e values spread out through the range and perform likelihood ratio evaluations at nearby values only. The following example offers an illustration of the effectiveness of the Monte Carlo multipoint linkage analysis method described above. The data come from a set of pedigrees studied by Palmer et al. (1994). The objective here is to map CSF1R relative to a map spanned by the markers D5S58, D5S72, D5S61, D5S211, in that order, on Chromosome 5. The recombination frequencies between the successive pairs of adjacent markers are 0.22, 0.09, and 0.36. The number of alleles for these loci range from 3 to 8. The multilocus genotypic configurations g(t),t = 1,···,N, were generated using a modified Gibbs sampler in which multilocus genotypes are updated individual-by-individual and locus-by-locus (Lin and Wijsman, 1994). Figure 4.2 shows a lod score curve with the lod scores estimated from the method described above. The x-axis plots genetic distance in centimorgans, while the y-axis plots the lod score. For this example, exact computation is still feasible so that the exact solutions can be compared to the estimates from MCMC, as shown in Figure 4.2. It is clear from the picture that MCMC produces a satisfactory estimate to the exact lod score curve and it only required 1/15 of the CPU time needed for computation using LINKAGE (Lin and Wijsman, 1994). With an additional marker, exact computation would no longer be practical so that MCMC approximation becomes an essential tool. 4.3. Inference of the mode of inheritance for complex traits Many common genetic diseases have exhibited both genetic and non-genetic components. These components may interact with one another leading to the manifestation of the disease. These traits are not simple Mendelian traits. In order to be able to describe them adequately, complex models are usually needed. This is especially important for localizing disease genes, because linkage analysis is sensitive to misspecification of the model. Furthermore, using larger pedigrees is usually more powerful than using smaller pedigrees, such as nuclear families. Complexity of the model and large complex pedigrees prevent the usual methods to be feasible. Approximation methods exist, such as PAP (Hasstedt and Cartwright, 1979). However, it has been almost impossible to evaluate performance

26

SHILl LIN

3

Linkage MCMC

2

-

L

0

d

-1

0

_.- ......

-2 -3 -4 -80

-40

o

40

80

120

160

centimorgans FIG. 4.2. Five-point lod score curve obtained by MCMC using the method of Lin (1995). Exact values from LINKAGE (Lathrop et al., 1994) are also shown for comparison.

of these methods. Therefore, MCMC has been explored as an alternative technique to fully utilize genetic information available. The role of MCMC is two-fold. On one hand, MCMC can itself be used as a method to estimate parameters of the model. On the other hand, MCMC can be used to check the validity of other approximation methods, because MCMC can achieve any degree of accuracy as long as the process is run for sufficient time. The latter may be of greater value, because other approximation methods are usually less computationally intensive and hence are preferred if they yield satisfactory results. Guo and Thompson (1992) proposed a Monte Carlo method for estimating the parameters of a complex model by utilizing realizations from the Gibbs sampler. The method was however restricted to data from diallelic genetic systems only. Further work was undertaken by Lin (1993) to extend these methods to data from multi allelic loci. We consider a mixed model, which is usually used for investigating the mode of inheritance of complex traits. The observed quantitative trait data, d, is modeled as influenced additively by the covariates (e.g. sex, age), the major gene, the additional polygenic heritable component, and the environment. Let f3 denote the vector of fixed effects, including the major gene effects for a given configuration of genotypes. Let a denote the vector of polygenic effects which are assumed jointly distributed as N(O, ()'~A), where A is the known numerator relationship matrix (Henderson, 1976). Let e denote the vector of residuals (thought of as the environmental effects)

MONTE CARLO METHODS IN GENETIC ANALYSIS

27

with a joint distribution N(O, u;I). Then for a given configuration of major genotypes and polygenic effects, the mixed model can be specified as the following: d

= X{3 +a+e,

where X is the design matrix for fixed effects. We are mainly interested in estimating the vector {3 and the variances u~ and Data from an informative genetic marker is incorporated into the estimation process so that the parameters of the model can be estimated more accurately. Therefore, if we let m denote the observed marker data and B denote the vector of parameters, including (3 and the recombination frequency r between the marker and the major gene locus, then the likelihood can be written as

u;.

L(B)

= P(d,m) = Lfe(d I g)Pe(m I g)Pe(g), g

since d and m are conditionally independent given the 2-locus joint genotype g. The sum in the above formula is over all 2-locus genotypic configurations in the pedigree. Since the joint genotypes and the polygenic values are independent, the likelihood can also be written as

L(B) = L g

1

fe(d I g,a)Pe(m I g)Pe(g)fe(a)da,

a

which is an explicit formula for evaluating the likelihood. The EM algorithm (Dempster et al., 1977) is employed to obtain estimates of parameters, since this is essentially a missing data problem in that both g and a are unobserved (Guo and Thompson, 1992). For example, the EM equation for the recombination frequency r between the trait and marker locus is

r

=

*

=

Ee(Rld,m)

= Ee(H I d,m)'

where H 2::i Hi and R 2::i Ri are the sufficient statistics for the recombination frequency r (Thomas and Cortessis, 1992). The sums are over all parent-offspring triples ofthe pedigree, where Hi is the number (0, 1, or 2) of doubly heterozygous parents in the ith parent-offspring triple, while Ri is the number of recombination events in segregation from parents to offspring. Despite the simplicity of the EM framework, it is very difficult to evaluate these conditional expectations explicitly. The joint distribution Pe(g, a I d, m) of genotypes and polygenic values given the observed data, which is the center piece for evaluating the conditional expectations, is intractable. Therefore, Monte Carlo estimation of these conditional expectations will be obtained instead, using realizations from a Markov chain with the joint conditional distribution as its equilibrium distribution.

28

SHILl LIN

Thompson et al. (1993) applied these methods to a large family which has elevated cholesterol levels. See Elston et al. (1975) for more about the pedigree and data. Estimates from MCMC were very similar to those from a different approximation method (Hasstedt and Cartright, 1979) that is currently being used routinely in the pedigree analysis of mixed models. 5. Sequential imputation and the MODY example. For multilocus linkage analysis, the sequential imputation method of Kong et al. (1994) has been implemented by Irwin et al. (1994) and Irwin (1995). Suppose that there are L loci under consideration. Let dl and gl denote the data and the underlying genotypes at locus I respectively, for 1= 1,2, ... , L. For a given parameter value (), the multilocus linkage likelihood L(()) = Peed), where d = (d 1, d 2,···, dL ), can be estimated by the method of sequential imputation. The basic idea of sequential imputation is to generate independent samples of the genotypes g = (gl, ... , gL) from a distribution P; (g I d) whose relationship to Pe(g I d) will be specified below. These samples can also be used to estimate likelihoods of other parameter values by an appropriately specified weighting scheme. To obtain a realization of g, the method derives the genotypes locus by locus from the appropriate sampling distributions. First gi is drawn from Pe(gl I dd and the predictive weight Wi = Pe(d 1) is computed. Then, for each successive locus I = 2,3, ... , L, gi is drawn from Pe (gl I d1, ... , dl , gi , ... , gi-1) and the accumulated predictive weight WI = wl-1Pe(dl I d1,···,dl- 1,gL···,gi_l) is computed. Note that the joint sampling distribution for g* = (gi, ... , gjJ is

P;(g I d)

Pe(gll ddI1F=2Pe(gll d1,···,dl,gl,···,gl-d w- 1Pe(d)Pe(g I d),

where W = WL = Pe(dd I1F=2 Pe(dl I d1,···, dl- 1, gl,''', gl-l). Consequently, averaging over g using P;(g I d) we obtain

Ep;(w) It follows that L( ())

Pe(g I d)

= E P;(g I d)Pe(d) = Peed).

= Pe (d) can be estimated by A

L(())

1

N

= Nt; wei),

where w(l),· .. , weN) are the accumulated weights of N independent realizations g(l),···, g(N) of g. In fact, the whole likelihood curve can be estimated via importance sampling from a set of such realizations based on a single parameter ()o. For instance, letting ()1 be any other parameter value other than ()o, then

MONTE CARLO METHODS IN GENETIC ANALYSIS

29

provides an unbiased estimate for L(Bt). However, one should note that L( Bt)would provide a good estimate only if Bl is close to Bo. The MODY example A pedigree which was diagnosed to segregate Maturity Onset Diabetes of the Young (MODY) was used as an example by Irwin (1995) to demonstrate the method. See Irwin (1995) for a diagram of the ISS-member simple pedigree and Bell et al. (1991) for a detailed description of the data. A multipoint linkage analysis was performed to study the location of the MODY gene relative to the eight markers on chromosome 20. An estimated lod score curve was obtained by Irwin (1995) and is shown as Figure 5.1. The x-axis plots the distances in centimorgans while the y-axis

r-...

;1'"

~

~

-

CD CD

o

10

20

30

40

50

Distance (centJmorgans) FIG. 5.1. Nine-point lod score curve obtained by the method of sequential imputation

for the MODY trait. (Figure 4.3 from Irwin (1995))

plots the lod scores. Exact computation of the likelihoods would have been impossible due to the large number of loci involved. The method of sequential imputation is feasible because one is never processing more than one locus at a time. However, in some cases, the sequential imputation computations are also impossible. The computations required for drawing realizations from are performed by the recursive algorithm of Elston and Stewart (1971) which, as discussed in earlier sections, has computational difficulties if the pedigree is complex with many loops. Therefore, although the sequential imputation method has been demonstrated to be feasible and successful

P;

30

SHILl LIN

for this large simple pedigree, it may fail to provide a practicable solution when data come from more complex pedigrees. 6. SmIle specific technical issues 6.1. Finding a starting configuration. The convergence and ergodic theorems guarantee that appropriate probability estimates from the Markov chain realizations converge to the true probabilities, regardless of the starting state, as long as the Markov chain is aperiodic and irreducible. However, convergence can be very slow unless the starting point is chosen appropriately. Thompson (1994a) and Gelman and Rubin (1992) provided examples which illustrate that a Markov chain can "get stuck" at a local mode which has negligible support from the data. Since good estimates depend on thorough exploration of the state space, a Markov chain starting from a poor initial state may provide poor probability estimates within a given amount of computing time. Therefore, for applications of MCMC methods, it is of practical importance that the Markov chain starts from a "good" state, not just any state with positive probability. Ideally, one would want to start from a state with high probability from the equilibrium distribution. For pedigree data, however, even just finding a "legal" state of genotypes, i.e. genotypes consistent with the observed phenotypic data, is difficult for a multiallelic genetic system. This is because of the constraint imposed by the first law of Mendelian inheritance (Mendel, 1865), and the fact that phenotypic data are usually missing for several upper generations. One approach to finding an initial starting genotypic configuration would be the method of gene-dropping described in section 2 above. This gene-dropping process would be repeated until an outcome consistent with the observed phenotypes is resulted. However, the process might have to be repeated for millions of times, even for pedigrees of moderate size, because in all but very small pedigrees it is virtually impossible to obtain samples which are compatible with the observed phenotypes. The method of Sheehan and Thomas (1993) offers another approach. With modified penetrances, it is guaranteed that the Markov chain will eventually find a legal state. In practice, this method may not find a legal state for quite a large number of scans, especially when the pedigree is large and the genetic system is highly polymorphic. Therefore, Wang and Thomas (1994) proposed a modification to the method. Instead of beginning with an arbitrary configuration of genotypes, they described a method to find a more "likely" genotypic configuration to start the search for a legal one. They first assigned founder genotypes by sampling only from the set of genes that were present among their descendants but had not assigned to their spouses. They then assigned genotypes to non-founders conditional on the parents' genotypes and on the genotypes among their descendants. The following describes a deterministic algorithm for finding a proba-

MONTE CARLO METHODS IN GENETIC ANALYSIS

31

ble starting configuration of genotypes. Individuals in the pedigree whose genotypes can be determined unequivocally from the phenotypes are assigned first. Then genotypes are assigned to the rest of the individuals in the pedigree backward in time, with the last generation processed first and the founders last. When assigning a genotype to an individual, it is made certain that the genotype assigned is consistent with his/her spouses and children's genotypes (including other children of his/her spouses), and with his/her parents and sibling's genotypes (including half-sibs). This algorithm produces valid genotype assignments for pedigrees that we have encountered in medical genetics studies. However, artificial counter examples exist. When an illegal genotypic configuration does result, the algorithm needs to be fine-tuned and more care must be taken to reassign genotypes. Several examples have demonstrated that starting configurations found using this algorithm can be much more probable. Such a state is usually a better place to start a Markov chain, to avoid being trapped in a low probability region.

6.2. Multiallelic locus and irreducibility. General HastingsMetropolis algorithms do not guarantee that the constructed Markov chains are ergodic, a necessary condition for inferences from the realizations. Ergodicity needs to be checked for each individual specific problem. In many areas of MeMe applications, ergodicity is not a problem, but it can be in genetic applications. It has been proved that Markov chains constructed from the Gibbs sampler are irreducible for most traits associated with two alleles (Sheehan and Thomas, 1993). However, for a locus with at least three alleles, examples exist where the Markov chains associated with the Gibbs sampler are not irreducible (Lin et al., 1993). The limitation to diallelic loci is a major problem, especially in linkage analysis, because multi allelic marker loci are much more informative than diallelic loci and hence preferred. For MeMe methods to be useful for linkage analysis, irreducibility for multiple alleles must be achieved to ensure validity of results. Reducibility ofthe Gibbs sampler applied to pedigree data results from the strong constraints on the joint genotypes of neighboring individuals in a pedigree. Many components of segregation and penetrance are O. By updating only one individual at a time, part of the genotypic configuration space may never be sampled. The state space is then divided into several communicating classes. States in different classes do not communicate. As a consequence, the ergodic theorem does not hold, and any inference made from the samples is thus invalid. Several methods have been proposed to solve this problem. Sheehan and Thomas (1993) proposed an importance sampling method. A small positive probability p is assigned to all zero penetrance probabilities or to all zero transmission probabilities, so that transition between states in different classes can be realized via "illegal" states introduced by the

32

SHILl LIN

relaxation parameter p. Although in principle this circumvents the problem of reducibility, the practicality of the method raises some questions. There is an obvious trade-off between the size of p and efficiency of the algorithm (Sheehan and Thomas, 1993; Gilks et al., 1993). Lin et al. (1993) showed that irreducibility for the Gibbs sampling Markov chain is achieved by assigning a small positive probability to all zero penetrances with heterozygote genotypes only. They further proved, without identifying all the communicating classes, that these penetrances are the minimum set of probabilities that need to be modified to ensure that states in different classes communicate. The so constructed irreducible chain is then coupled with the original Gibbs sampling chain to form a new integrated process. By switching between chains after every scan with a suitable probability, the correct limiting distribution is preserved. Estimates of the desired probabilities and expectations are obtained using realizations from the distribution of interest, whereas the auxiliary chain only serves to facilitate such simulations from the "right" distribution. This is in contrast to importance sampling methods in which realizations are simulated from the "wrong" distribution and then reweighted. Although the method of Lin et al. (1993) was shown to work well for a triallelic data set from a large complex pedigree, it is unlikely that good results will still be obtainable with highly polymorphic loci. From an example in Lin et al. (1993), it becomes quite clear that, in order to have a more efficient algorithm, one needs to identify the communicating classes explicitly. This task was undertaken by Lin et al. (1994b). They noted that it was observed data on children that were responsible for creating noncommunicating alternatives for unobserved parents. Hence, it was possible to search for communicating classes by looking at each nuclear family successively, from the bottom of a pedigree, tracing up. This lays the basis for the work of Lin (1995) who proposes a new scheme for constructing an irreducible chain by "jumping" from one communicating class to another directly without the need of stepping through illegal configurations. Every realization can be used for making inferences. Furthermore, switching from one communicating class to another is much more frequent. This leads to better sampling of the space of genotypic configurations and hence provides much more accurate probability estimates, compared to other methods for the same amount of computing time. For the pedigree considered in Lin (1995), it took only 1/30 of the time needed for the method of Sheehan and Thomas (1993) to achieve the same degree of accuracy. For larger pedigrees, such as the Alzheimer pedigree considered in Lin et al. (1993) and the hypercholesterolemia pedigree considered in Thompson et al. (1993), the method achieved even better results. 6.3. Multimodality and more efficient samplers. The Gibbs sampler is often chosen as an MCMC algorithm for sampling the space of genotypes because of its simplicity: the conditional genotype distribu-

MONTE CARLO METHODS IN GENETIC ANALYSIS

33

tion of an individual depends only on the phenotype and genotypes of the neighbors. More importantly, the Gibbs sampler avoids problems caused by sparsity of the genotypic configuration space. MCMC algorithms that make changes to several individuals simultaneously are much harder to implement due to the zeros imposed by Mendelian segregation and the difficulty in computing the requisite ratios. However, the Gibbs sampler can be very slow to sample the space of genotypes. If the equilibrium distribution is multimodal, the sampler may remain near a local mode for a long time. It is often quite informative to run a few chains from different starting points, but any formal conclusion will be impossible, as there is no framework for combining results from multiple runs. Even if it were possible to identify all the local modes and then start a chain from each local mode, we still would not know how to combine the results since we would not know the weight for each mode (Geyer, 1992; Raftery and Lewis, 1992). We therefore need more efficient algorithms than the Gibbs sampler to adequately sample the space. Although multimodality is one of the major general problems facing MCMC exploration of a probability surface, algorithms which are efficient for one particular applications may not be advantageous for others, see e.g. Besag and Green (1993). Hence it is clear that more efficient algorithms specifically tailored to genetic applications should be designed. We need an algorithm which will facilitate movement from one local mode to another. Unless one can design an algorithm which jumps between modes, such transitions can only be realized by stepping through low probability states between modes. Therefore any such algorithm must allow the Markov chain to stay at low probability states long enough to move to another mode, rather than moving back to the original mode. This idea leads to the construction of the heated-Metropolis algorithm proposed by Lin et al., (1994a). The easily computed local conditional distributions of the Gibbs sampler are raised to the power ~, where T :::: 1 is a parameter known as "temperature". This modified local conditional distribution is used as the proposal distribution of a Metropolis-Hastings algorithm. It has been successfully applied to estimate carrier probabilities on the Hutterite pedigree described earlier.

6.4. Order of loci and other issues in sequential imputation Efficiency of the estimates obtained from the method of sequential imputation depends on the order in which the loci are processed in the imputation procedure. Since the Monte Carlo estimate for the multilocus likelihood is the average of the accumulated weights over a collection of imputations, the best order of loci is the one that minimizes the variance of the accumulated weight. Note that, at each step of imputation, the sampling distribution is conditional not only on the observed data, but also on any previously imputed values. Therefore, intuitively, one would like to order the marker

34

SHILl LIN

loci according to the amount of data available at each locus. That is, locus with most individuals whose genotypes are typed is processed first, while the least typed locus should be processed last. For two loci with about the same number of individuals typed, the more informative one, i.e. the one with more alleles should be processed ahead of the other one. The goal of this simple rule is to utilize information available as much as possible to reduce the variance of the estimate. This is however only a rule of thumb, and therefore it does not guarantee that the best ordering will result. This rule of thumb also ignores the importance of who are typed as opposed to just the number of individuals typed. For mapping a disease gene against a set of known genetic makers, the disease locus can be processed either first or last in the sequential imputation procedure. For the MODY example in section 5, the disease gene was processed last. This allows calculation of likelihoods at various locations with a single collection of marker imputations. However, as we point out in section 5, the likelihood estimate is unlikely to be accurate unless the sampling distribution is close to the target distribution. The alternative strategy of processing the disease gene first may work better, when the disease status are known for many individuals in the upper generations of the pedigree while their marker genotypes are unknown. Details can be found in Irwin et aI. (1994). For the algorithm described in section 5, genotypes are generated one locus at a time. In particular, gi is sampled from the distribution PO(gl I dI), where d 1 is the observed data at the corresponding locus. However, as long as it is possible to sample from the distribution, d1 should include observed data from as many loci as possible to achieve more efficient estimates (Irwin et aI., 1994). It should be pointed out here again that sampling from PO(gl I d 1) requires computations using the recursive algorithm of Elston and Stewart (1971), which may be impractical when data from more than one locus are involved. 7. Concluding remarks. Markov chain Monte Carlo has been shown to be a powerful technique for estimating probabilities and likelihoods in genetic analysis, when exact computations are not feasible. It is applicable to many different types of problems, illustrated in the paper through three such applications. Although the fundamental theory of MCMC is simple, finding a suitable algorithm to ensure efficient results can be very difficult. Some of the technical problems associated with MCMC are common to many areas of applications. Some however are unique to problems from genetic analysis with complex pedigree structures and data. The foremost issue is to ensure irreducibility of the Markov chain. Although this is almost always satisfied for problems from many applications, it is often not the case with data arising from pedigrees. It should be emphasized that, if irreducibility is violated, then any inference from such realizations is invalid, no matter how long the process is run. This problem

MONTE CARLO METHODS IN GENETIC ANALYSIS

35

is not solved by running multiple processes from several starting points either. Among various solutions proposed, the method of Lin (1995), which jumps directly between communicating classes, seems to be quite promising. Efficient results have been obtained for several problems considered. However, there are always more difficult problems which would defeat the method. Solutions will have to be invented to meet new challenges. The method of sequential imputation has been shown to be a successful technique for estimating likelihoods for multilocus linkage analysis. However, the method may not be applicable to other genetic pedigree analysis problems where other factors of complexity are involved, such as complex traits and complex pedigrees. MeMe and sequential imputation may be viewed as complementary techniques to one another. Whereas the method of sequential imputation may be more efficient in multipoint computations with simple traits and simple pedigrees, MeMe is more suitable for complex traits and pedigrees with many loops. Acknowledgment. I am grateful to Professor Terry Speed for helpful comments on earlier versions of this manuscript, to Dr. Mark Irwin for permission to use Figure 5.1 and comments on the manuscript, and to Dr. Ellen Wijsman for computing the exact lod scores for Figure 4.2. This work is supported in part by NIH grant ROI HG01093-01. REFERENCES Bell, G. I., Xiang, K. S., Newman, M. V., Wu, S. H., Wright, L. G., Fajans, S. S., Spielman, R. S., and Cox, N. J. (1991) Gene for the non-insulin-dependent diabetes mellitus (Maturity Onset Diabetes of the Young) is linked to DNA polymorphism on human chromosome 20q. Proc. Natl. Acad. Sci. USA 88, 1484-1488. Besag, J. and Green, P. J. (1993) Spatial statistics and Bayesian computation (with discussion). J. Roy. Statist. Soc. B 55, 25-37. Boehnke, M. (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am. J. Hum. Genet. 39, 513-527. Cannings, C., Thompson, E. A., and Skolnick, M. H. (1978) Probability functions on complex pedigrees. Adv. Appl. Prob. 10, 26-61. Cottingham, R. W. Jr., Idury, R. M., and Schaffer, A. A. (1993) Faster sequential genetic linkage computations. Am. J. Hum. Genet. 53, 252-263. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B 39, 1-38. Dwarkadas, S., Schaffer, A. A., Cottingham, R. W. Jr., Cox, A. L., Keleher, P., and Zwaenepoel, W. (1994) Parallelization of general-linkage analysis problems. Hum. Hered. 44, 127-141. Easton, D. F., Bishop, D. T., Ford, D., Crockford, G. P., and the Breast Cancer Linkage Consortium (1993) Genetic Linkage Analysis in familial breast and ovarian cancer: results from 214 families. Am. J. Hum. Genet. 52, 678-701. Elston, R. C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered. 21, 523-542. Elston, R. C., Namboodiri, K. K., Glueck, C. J., Fallat, R, Tsang, R and Leuba, V. (1975) Study of the genetic transmission of hypercholesterolemia and hypertriglyceridemia in a 195 member kindred. Am. J. Hum. Genet. 39, 67-83. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling-based approaches to calculating marginal densities. J. Am. Statist. Assoc. 85,398-409.

36

SHILl LIN

Gelman, A. and Rubin, D. (1992) Inference from interative simulation using multiple sequences. Statist. Sci. 7:457-472. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6, 721-74l. Geyer, C. J. (1992) A practical guide to Markov chain Monte Carlo. Statist. Sci. 7, No.4, 473-483. Gilks, W. R., Clayton, D. G., Spiegelhalter, D. J., Best, N. G., McBeil, A. J., Sharples, L. D. and Kirby, A. J. (1993) Modelling complexity: Applications of Gibbs sampler in medicine (with discussion). J. Roy. Statist. Soc. B 55, 39-52. Goradia, T. M., Lange, K., Miller, P. L., Naskarni, P. M. (1992) Fast computation of genetic likelihoods on human pedigree data. Hum. Hered. 42, 42-62. Guo, S. and Thompson, E. (1992) A Monte Carlo method for combined segregation and linkage analysis. Am. J. Hum. Genet. 51, 1111-1126. Hammersley, J. M. and Handscomb, D. C. (1964) Monte Carlo methods. John Wiley & Sons Inc., New York. Hasstedt, S. J. and Cartwright, P. (1979) PAP - Pedigree Analysis Package. Technical Report 13, Department of Medical Biophysics and Computing, University of Utah, Salt Lake City, Utah. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97-109. Henderson, C. R. (1976) A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32, 69-83. Irwin, M., Cox, N., and Kong, A. (1994) Sequential imputation for multilocus linkage analysis. Proc. Nail. Acad. Sci. USA 91, 11684-11688. Irwin, M. (1995) Sequential imputation and multilocus linkage analysis. Ph. D. Thesis, Department of Statistics, University of Chicago, Chicago, IL. Kong, A., Liu, J., and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. J. Am. Statist. Assoc. 89, 278-288. Lange, K., and Elston, R. C. (1975) Extensions to pedigree analysis: likelihood computations for simple and complex pedigrees. Hum. Hered. 25, 95-105. Lange, K., and Boehnke, M. (1983) Extensions to pedigree analysis. V. Optimal calculation of Mendelian likelihoods. Hum. Hered. 33, 291-30l. Lange, K., and Matthysse, S. (1989) Simulation of pedigree genotypes by random walks. Am. J. Hum. Genet. 45, 959-970. Lange, K., and Sobel, E. (1991) A random walk method for computing genetic location sores. Am. J. Hum. Genet. 49, 1320-1334. Lathrop, G. M., Lalouel, J. M., Julier, C., and Ott, J. (1984) Strategies for multilocs linkage analysis in humans. Proc. Nail. Acad. Sci. USA 81, 3443-3446. Lin, S. (1993) Markov chain Monte Carlo estimates of probabilities on complex structures. Ph.D. Thesis, Department of Statistics, University of Washington, Seattle, WA. Lin, S., Thompson, E., and Wijsman, E. (1993) Achieving irreducibility of the Markov chain Monte Carlo method applied to pedigree data. IMA J. Math. Appl. Med. Bioi. 10, 1-17. Lin, S., Thompson, E., and Wijsman, E. (1994a) An Algorithm for Monte Carlo Estimation of Genotype Probabilities on Complex Pedigrees. Ann. Hum. Genet. 58, 343-357. Lin, S., Thompson, E., and Wijsman, E. (1994b) Finding noncommunicating sets for Markov chain Monte Carlo estimations on pedigrees. Am. J. Hum. Genet. 54, 695-704. Lin, S., and Wijsman E. (1994) Monte Carlo multipoint linkage analysis. Am. J. Hum. Genet. 55, A40. Lin, S. (1995) A scheme for constructing an irreducible Markov chain for pedigree data. Biometrics, 51, 318-322. MacCluer, J. W., Vandeburg, J. L., Read, B. and Ryder, O. A. (1986) Pedigree analysis

MONTE CARLO METHODS IN GENETIC ANALYSIS

37

by computer simulation. Zoo BioI. 5, 149-160. Mendel, G. (1865) Experiments in Plant Hybridisation. Mendel's original paper in English translation, with a commentary by R. A. Fisher. Oliver and Boyd, Edinburgh, 1965. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092. Miller, P. L., Nadkarni, P., Gelernter, J. E., Carriero, N., Pakstis, A. J., and Kidd, K. K. (1991) Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology. Compo Biomed. Res. 24, 234-248. Murray, J. C., Buetow, K. H., Weber, J. L., Ludwigson, S., Scherpier-Heddema, T., Manion, F., Quillen, J., Sheffield, V. C., Sunden, S., Duyk, G. M., Weissenbach, J., Gyapay, G., Dib, C., Morrissette, J., Lathrop, G. M., Vignal, A., White, R., Matsunami, N., Gerken, S., Melis, R., Albertsen, H., Plaetke, R., Odelberg, S., Ward, D., Dausset, J., Cohen, D., and Cann, H. (1994) A comprehensive human linkage map with centimorgan density. Science 265, 2049-2064. Ott, J. (1974) Computer simulation in human linkage analysis. Am. J. Hum. Genet. 26, 64A. Ott, J. (1979) Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees. Am. J. Hum. Genet. 31, 161-175. Ott, J. (1989) Computer-simulation methods in human linkage analysis. Proc. Natl. Acad. Sci. USA (Genetics) 86, 4175-4178. Ott, J. (1991) Analysis of Human Genetic Linkage. The Johns Hopkins University Press, Baltimore, MD. Palmer, S. E., Dale, D. C., Livingston, R. J., Wijsman, E. M., and Stephens, K. (1994) Autosomal dominant hematopoiesis: exclusion of linkage to the major hematopoietic regulatory gene cluster on chromosome 5. Hum. Genet. 93, 195-197. Ploughman, L. M. and Boehnke M. (1989) Estimation of the power of a proposed linkage study for a complex genetic trait. Am. J. Hum. Genet. 44, 543-55l. Raftery, A. and Lewis, S. (1992) How many iterations in the Gibbs sampler? In Bayesian Statistics 4 (eds. J. M. Bernardo, J. Berger, A. P. Dawid and A. F. M. Smith), 765-776. Rikvold, P. A., and Gorman, B. M. (1994) Recent results on the decay of metastable phases. Technical report 64, Supercomputer Computations Research Institute, Florida State University, Tallahassee, Florida. Schaffer, A. A., Gupta, S. K., Shriram, K., and Cottingham, R. W. Jr. (1994) Avoiding recomputation in linkage analysis. Hum. Hered. 44, 225-237. Schellenberg, G. D., Pericak-Vance, M. A., Wijsman, E. M., Boehnke, M., Moore, D. K., Gaskell, P. C. Jr., Yamaoka, L. A. et al (1990) Genetic analysis of familial Alzheimer's disease using chromosome 21 markers. Neurobiol. Aging 11:320. Schellenberg, G. D., Bird, T. D., Wijsman, E. M., Orr, H. T., Anderson, L., Nemens, E., White, J. A., Bonnycastle, L., Weber, J. L., Alonso, M. E., Potter, H., Heston, L. L., and Martin, G. M. (1992) Genetic linkage evidence for a familial Alzheimer's disease locus on chromosome 14. Science 258, 668-67l. Sheehan, N. (1990) Genetic reconstruction on pedigrees. Ph. D. Thesis, Department of Statistics, University of Washington, Seattle, WA. Sheehan, N. and Thomas, A. (1993) On the irreducibility of a Markov chain defined on a space of genotype configurations by a sampling scheme. Biometrics 49, 163-175. Smith, A. F. M. and Roberts, G. O. (1993) Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. B 55, 3-23. Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior distributions by data augmentation (with discussion). J. Am. Statist. Assoc. 82, 528-550. Thomas, D. C., Cortessis, V. (1992) A Gibbs sampling approach to linkage analysis. Hum. Hered. 42,63-76. Thompson, E. A. and Guo, S-W (1991) Evaluation of likelihood ratios for complex genetic models. IMA J. Math. Appl. Med. Bioi. 8, 149-169.

38

SHILl LIN

Thompson, E., Lin, S., Olshen, A., and Wijsman, E. (1993) Monte Carlo analysis of a large hypercholesterolemia pedigree. Genet. Epidemiol. 10, 677-682. Thompson, E. A. (1994a) Monte Carlo likelihood in the genetic analysis of complex traits. Phil. Trans. Roy. Soc. London Ser. B, 344, 345-351. Thompson, E. A. (1994b) Monte Carlo likelihood in genetic mapping. Statist. Sci. 9, 355-366. Wang, S. J., and Thomas, D. (1994) A Gibbs sampling approach to linkage analysis with multiple polymorphic markers. Technical report 85, Department of Preventive Medicine, University of Southern California, Los Angeles. Wright, S and McPhee, HC (1925) An approximate methods of calculating coefficients of inbreeding and relationship from livestock pedigrees. J. Agricul. Res. 31, 377-383.

INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING BRONYA KEATS·

The Human Genome Project has had a major impact on genetic research over the past five years. The number of mapped genes is now over 3,000 compared with approximately 1,600 in 1989 (Human Gene Mapping 10, [5]) and only about 260 ten years before that (Human Gene Mapping 5, [4]). The realization that extensive variation could be detected in anonymous DNA segments (Botstein et al. [1]) greatly enhanced the potential for mapping by linkage analysis. Previously, linkage studies had depended on polymorphisms that could be detected in red blood cell antigens, proteins (revealed by electrophoresis and isoelectric focusing), and cytogenetic heteromorphisms. The identification of thousands of polymorphic DNA markers throughout the human genome has led to the construction of high density genetic linkage maps. These maps provide the data necessary to test hypotheses concerning differences in recombination rates and levels of interference. They are also important for disease gene mapping because the existence of these genes must be inferred from the phenotype. Showing linkage of a disease gene to a DNA marker is the first step towards isolating the disease gene, determining its protein product, and developing effective therapies. However, interpretation of results is not always straightforward. Factors such as etiological heterogeneity and undetected irregular segregation can lead to confusing linkage results and incorrect conclusions about the locations of disease genes. This paper will discuss these phenomena and present examples that illustrate the problems, as well as approaches to dealing with them.

Genetic markers. Any detectable variation provides a potential marker for linkage analysis. Several different types of DN A polymorphisms have been developed. Those that are easy to detect and have high heterozygosity (1 - I:Pi, where Pi is the frequency of the i-th allele) are preferred, and many such markers have been placed on genetic linkage maps. This endeavor has been helped by the Centre D'Etude Polymorphisme Humain (CEPH) collaboration, in which many markers have been typed using the same set of families in different laboratories (Dausset et al. [3]). The majority of DNA markers used for linkage studies are short tandem repeat polymorphisms (STRPs) or microsatellites (Weber and May, [21]). They are very short repeated sequences, usually 2-4 base pairs. The variation in number of repeats is easily detected by first using the polymerase chain reaction (PCR) with appropriate primers to amplify the rel• Department of Biometry and Genetics, and Center for Molecular and Human Genetics, Louisiana State University Medical Center, New Orleans, Louisiana 70112. 39

40

BRONYA KEATS

evant piece of DNA and then separating the fragments by electrophoresis on polyacrylamide sequencing gels. Bands are generally visualized by autoradiography or fluorescence. Most STRPs on linkage maps have much higher heterozygosities than another type of DNA marker, the Restriction Fragment Length Polymorphism (RFLP). Detection of an RFLP requires Southern blotting and hybridization to a cloned DNA probe after digestion of genomic DNA with a restriction endonuclease. In addition to having higher heterozygosities than RFLPs, STRPs are far less time consuming to genotype and are much more abundant in the genome. Variable number of tandem repeat (VNTR) markers or minisatellites, are detected in the same way as RFLPs, but the variation is a result of differences in the number of times a sequence is repeated between two restriction sites. They have high heterozygosities but are found far less often than STRPs and tend to congregate near the telomeres. Genetic linkage map. Both the physical map and the genetic linkage map must have the same order. Distances on the two maps, however, are not closely proportional and male genetic distance differs from that in females. Distance on the physical map is measured in base pairs while genetic distance is a function of meiotic recombination rate. Genetic map distances are additive; recombination fractions are not. The genetic map distance between markers is measured in terms of the average number of crossovers per chromatid that occur between them. The unit of genetic distance is the Morgan, one Morgan being the interval that yields an average of one crossover per chromatid. As each crossover event involves two chromatids and there are four chromatids present during meiosis when crossing over occurs, an average of one crossover per chromatid is equivalent to an average of two chiasmata. Thus, genetic distance is equal to half the mean number of chiasmata occurring between two markers. Genetic distance may also be given in centiMorgans (cM): 1 Morgan = 100 cM. If the genetic length of a chromosome is 2 Morgans, then an average of two crossovers per chromatid or four chiasmata occur on this chromosome. In males approximately 53 chiasmata per cell are observed cytogenetically. Therefore, male genetic length is about 26.5 Morgans. Although genetic distance is not proportional to physical distance, in general, the longer the physical length, the longer the genetic length. The total human haploid genome is approximately 3 x 10 9 base pairs, and the total sex-averaged genetic length is estimated to be about 33 Morgans. Thus, on average, one centiMorgan is equivalent to about a million base pairs, although this correspondence varies throughout the genome; there are both short physical segments with high recombination rates and long segments with low recombination rates. For example, chromosome 19 is one of the shortest chromosomes with a physical length of only about 62 megabases, while its male genetic length is 114 cM and its female genetic length is 128 cM (Weber et al. [22]). Thus, for this chromosome, one centiMorgan is equivalent to 500,000 base pairs.

INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING 41

Keats et al. [9] presented guidelines for genetic linkage maps. The linkage map is constructed by statistical analysis and the logarithm of the likelihood ratio, 10glO(Ll/ L 2 ), is generally used to measure support. A map consisting of markers for which order is well-supported is called a framework map. At least three measures of support are of relevance in building a linkage map. Global support is the evidence that a marker belongs to a linkage group; it is calculated by setting Ll as the maximum likelihood when the marker is placed in the linkage group and L2 as the likelihood when the marker has free recombination with the linkage group. Interval support provides the evidence that a marker is in a specified order relative to a set of framework markers. In this case, Ll is the likelihood under the given order and L2 is the highest likelihood obtained by placing the marker in any other interval on the framework map. Support for order of a set of markers is calculated by taking Ll as the likelihood under the favored order and L2 as the likelihood under a competing order. For each of these measures of support a value of at least 3 is recommended. Accurate genotyping is essential for the construction of linkage maps. Even a low error rate can substantially inflate map length (Buetow [2]), and typing errors may sometimes lead to incorrect orders. Interference. The phenomenon of interference needs to be considered in the construction of linkage maps. Recombination frequencies are not additive because multiple crossovers may occur between markers. An offspring is a nonrecombinant if an even number of crossovers occurs between two markers, and a recombinant if an odd number of crossovers occurs between the two markers. In addition, crossing over in one region interferes with crossing over in a neighboring region. Two types of interference may be differentiated: chiasma interference and chromatid interference. Chiasma interference is the influence of an already formed chiasma on the formation of a new one. If the interference is positive, then a second chiasma is less likely to occur, and if it is negative, a second chiasma is more likely to occur than would be expected by chance. Chromatid interference is the departure from random occurrence of any of the four strands in the formation of chiasmata. It is difficult to detect and good evidence that it exists has not yet been found. Under complete interference the genetic map distance is equal to the recombination fraction and the distance between two markers is at most 50 centiMorgans. Assuming no interference simplifies calculations but it leads to considerable overestimation of map distances. Weber et al. [22] obtained strong evidence for chiasma interference on chromosome 19. Although they made a number of simplifying assumptions, the observed number of double recombinants was significantly lower than that expected if there is no interference. Sex heterogeneity. Male and female estimates of the recombination fraction are different for many regions of the genome. Thus, male and female linkage maps need to be estimated separately, constrained by order.

42

BRONYA KEATS TABLE 1

Male and female genetic distances in telomeric (short arm and long arm) and centromeric regions of chromosome 19.

Marker

Location

Distance (cM) Male Female

D19S20 short arm

28.8

7.1

centomeric

2.4

11.3

long arm

27.4

6.8

D19S247 D19S199 D19S49 D19S180 D19S254

Overall, female genetic length is longer than male genetic length, but the ratio varies with position on the chromosome. For some chromosomes there appears to be an excess of female recombination near the centromere and an excess of male recombination near the telomeres, but these relationships are not yet known precisely. Table 1 shows male and female map distances for regions of chromosome 19 near the centromere and near the telomeres of the short arm and the long arm.

Etiological heterogeneity. Linkage studies to map disease genes show that identical clinical phenotypes do not necessarily mean that the disease is caused by a mutation in the same gene in all affected individuals. Morton [14] analysed families with elliptocytosis and showed that the gene causing the disease was linked to the Rhesus blood group on the short arm of chromosome 1 in some families but not in others. This conclusion was based on his finding that there was significant heterogeneity of the recombination fraction among families. Thus, variation in the recombination fraction suggests that genes at more than one chromosomal location may cause the same clinical phenotype. Another example of this heterogeneity is for the neuropathy, Charcot-Marie-Tooth type I, in which patients have very slow nerve conduction velocities. Initial studies suggested linkage to the Duffy blood group on chromosome 1 in a few families but not in others. Additional studies showed that in many of the unlinked families the disease gene was linked to markers on the short arm of chromosome 17 (Vance et al. [20]). Thus heterogeneity of the recombination fraction first indicated that more than one gene may cause this neuropathy, and proof of this was obtained when the location of a second gene for the disease was found. Two further diseases for which several genes cause the same clinical phenotype

INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING 43

DllS861 DllS419 DllS1397 DllS921 DllS1310 DllS899

8 2 1 5 5 7

FIG. 1. Haplotypes for family showing recombination between D11S1397 and DllS921.

are discussed below. They are Usher syndrome type I and spinocerebellar ataxia. Usher Syndrome. Usher syndrome is characterized by hearing impairment, retinitis pigmentosa, and recessive inheritance. Three types are distinguished clinically based on severity and progression of the hearing impairment. Family studies of the three types of Usher syndrome have demonstrated genetic as well as clinical heterogeneity. Three genes for type I have been localized to the short arm of chromosome 11 (Smith et al. [18]), the long arm of chromosome 11 (Kimberling et al. [10]), and the long arm of chromosome 14 (Kaplan et al. [6]). Kimberling et al. [11] and Lewis et al. [12] assigned a gene for type II to chromosome 1, and a gene for type III was recently assigned to chromosome 3 (Sankila et al. [17]). One strategy to reduce the chance that different genes are responsible for Usher syndrome type I in a set of families is to select families from an isolated population such as the Acadians of southwestern Louisiana. According to Rushton [16], about 4,000 Acadians made their way to Louisiana during the second half of the 18th century when the English ordered their expulsion from Acadia (now Nova Scotia and surrounding areas). They settled on the plains among the bayous of southwestern Louisiana and remained relatively isolated because of linguistic, religious, and cultural cohesiveness, as well as geographic isolation. The gene for Usher syndrome type I (USHIC) on the short arm of chromosome 11 has been found only in the Acadian population, and the region containing the disease gene was refined by Keats et al. [7]. Figure 1

44

BRONYA KEATS

15.5 15A 15.3 15.2 15.1

14

--------------DllS861 1

DllS419 1

DllS1397 0.5

DllS921 0.5

DllS1310 1

DllS899 FIG. 2. Map Showing location of the Acadian Usher syndrome type I gene (USHIC).

shows a family in which recombination between the markers DllS1397 and DllS921 is observed in one of the affected offspring. This result provides strong evidence that USHIC is flanked on one side by the marker DllS1397. In order to find a flanking marker on the other side of USH1C, we examined the marker alleles that were inherited with the disease alleles in each affected individual. Table 2 shows that the same DllS921 allele was found on all 54 chromosomes with the disease allele but four of these chromosomes had a different allele for DllS1310. Thus, USHIC is likely to be between DllS1397 and DllS1310. Figure 2 shows the map giving the order of the markers and the distances between them measured in centiMorgans. The region to which we have mapped the gene for Acadian Usher syndrome type I is about 1.2 centiMorgans which is probably less than 1.5 megabases of DNA and we are continuing our efforts to isolate and characterize this disease gene. Spinocerebellar Ataxia. The spinocerebellar ataxias are a heterogeneous group of disorders characterized by lack of coordination of movements due to progressive neurodegeneration in the cerebellum. The age of onset of symptoms is usually between the third and fifth decades, and death occurs 10 to 15 years later. Several different genes that cause dominantly inherited spinocerebellar ataxia have now been localized. Genetic heterogeneity complicates the search for disease genes. Finding a recombination event is critical to defining flanking markers, but the possibility that the disease gene is elsewhere cannot be ignored especially

INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING

45

TABLE 2 Marker alleles associated with the Acadian Usher chromosome.

DllS1397 3 1 3 3 3 3 3 3 3 1

DllS921 4 4 4 4 4 4 4 4 4 4

DllS1310 3 3 3 3 3 3 4 5 4 4 Other

DllSS99 2 2 9 6 4 S 7 6 2 9

Total

Usher 40 1 5 1 1 2 1 1 1 1 0

Non-Usher 1 0 1 0 0 2 0 0 1 1 44

54

50

TABLE 3 Lod scores for SCAl.

Marker HLA D6SS9

0.0 -00

4.9

Recombination Fraction .1 .2 .01 .05 .3 -3.5 -2.2 -1.5 -O.S -0.4 4.S 4.4 3.9 2.S 1.7

.4 -0.2 0.7

if the family is small. On the other hand, results that suggest exclusion of a gene from a region may be misleading. Originally the location of SCA1 (spinocerebellar ataxia type I) on chromosome 6 had been demonstrated through linkage to HLA. Keats et al. [S] reported a family where evidence of linkage to HLA was not obtained and the initial conclusion was that a different gene was responsible for the disease in this family. However, a more tightly linked marker, D6SS9, was found (Zoghbi et al. [23]), and Keats et al. [S] showed that there was no recombination between this marker and the disease gene in their family. Table 3 gives the lod scores with HLA and D6SS9; these two markers are about 15 cM apart on the short arm of chromosome 6. Unusual segregation. Etiological heterogeneity complicates the interpretation of linkage results and is of major concern because it is relatively common. Unusual segregation patterns appear to be less common but when they occur linkage results can be confusing and misleading. Charcot Marie Tooth Disease. Charcot- Marie-Tooth neuropathy is a heterogeneous disease characterized by slowly progressive muscle weakness and atrophy. The most common mode of inheritance is autosomal domi-

46

BRONYA KEATS

11(1,2)

11(1,2)

1/2

112

1/2

11(1,2)

11(1,2)

FIG. 3. Genotypes for the marker D17S122. Individuals with Charcot-Marie Tooth disease (solid squares and circles) have three alleles.

nant and a gene on the short arm of chromosome 17 accounts for the majority of these cases. Vance et al. [20] reported linkage of the disease gene (CMT1A) to markers on chromosome 17. However, the marker, D17S122, gave discrepant results. Based on known map distances this marker should have been tightly linked to CMTIA, but many recombination events were observed. This inconsistent result was resolved when Lupski et al. [13] demonstrated the presence of a duplication. In a large family reported by Nicholson et al. [15] the maximum lod score increased from 0.5 at a recombination fraction of 0.3 to 34.3 at zero recombination after taking the duplication into account. The effect of the duplication on recombination is seen in Figure 3 where the father and all of the offspring would be assigned the genotype 1/2 if the duplication were ignored. In this case at least two of the offspring must be recombinants. When the presence of the third allele is recognized the genotypes are consistent with no recombination. Uniparental Disomy. The phenomenon of uniparental disomy, in which both copies of a chromosome are inherited from one parent, also leads to inconsistent linkage results. This event is relatively rare, but it has been documented in several clinical disorders. For example, Spence et al. [19] showed that it was the cause of a case of the recessively inherited disorder, cystic fibrosis. Rather than inheriting one copy of the defective gene from each parent, both copies came from the mother. Genotyping of chromosome 7 markers showed that the child had two maternal copies of this chromosome and no paternal chromosome 7. Although inconsistencies between father and offspring are almost certain to be found in this situation, some markers are likely to give compatible genotypes and recombination would be assumed to have occurred. For example, if the parental genotypes

INTERFERENCE, HETEROGENEITY AND DISEASE GENE MAPPING

47

at a marker tightly linked to the disease gene are 1/2 and both an affected and an unaffected offspring have the genotype 1/1, then one of the offspring would be assumed to be a recombinant. In fact, however, uniparental disomy may explain the affected individual. Conclusions. The discovery of thousands of highly polymorphic microsatellite markers than span the genome at small intervals has had a huge impact on our understanding of the genetic linkage map. As well as leading to the localization of disease genes, it has provided the tools necessary to study variation in recombination among groups and to examine the phenomenon of interference. Several unexpected results have changed our way of thinking about transmission of alleles from one generation to the next. The research that is resulting from the goals of the Human Genome Project is truly revolutionary and will benefit mankind in many ways.

REFERENCES [1] Botstein, D., White, R. L., Skolnick, M., Davis, R. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet., 32:314-331, 1980. [2] Buetow, K. H., Influence of aberrant observations on high-resolution linkage analysis outcomes. Am. J. Hum. Genet., 49:985-994, 1991. [3] Dausset, J., Cann, H., Cohen, D., et al., Centre d'Etude du Polymorphisme Humain (CEPH): Collaborative genetic mapping of the human genome. Genomics, 6:575-577, 1990. [4] Human Gene Mapping 5: Fifth International Workshop on Human Gene Mapping. Cytogenet. Cell Genet., 25:1-236, 1979. [5] Human Gene Mapping 10: Tenth International Workshop on Human Gene Mapping. Cytogenet. Cell Genet., 51:1-1148, 1989. [6] Kaplan, J., Gerber, S., Bonneau, D., Rozet, J., Delrieu, 0., Briard, M., Dollfus, H., Ghazi, I., Dufier, J., Frezal, J., Munnich, A. A gene for Usher syndrome type I (USH1) maps to chromosome 14q. Genomics, 14:979-988,1992. [7] Keats, B. J. B., Nouri, N., Pelias, M. Z., Deininger, P. L., Litt, M. Tightly linked flanking microsatellite markers for the Usher syndrome type I locus on the short arm of chromosome 11. Am. J. Hum. Genet., 54:681-686, 1994. [8] Keats, B. J. B., Pollack, M. S., McCall, A., Wilensky, M. A., Ward, L. J., Lu, M., Zoghbi, H. Y. Tight linkage of the gene for spinocerebellar ataxia to D6S89 on the short arm of chromosome 6 in a kindred for which close linkage to both HLA and F13A1 is excluded. Am. J. Hum. Genet., 49:972977, 1991. [9] Keats, B. J. B., Sherman, S. L., Morton, N. E., Robson, E. B., Buetow, K. H., Cartwright, P. E., Chakravarti, A., Francke, U., Green, P. P., Ott, J. Guidelines for human linkage maps: An international system for human linkage maps (ISLM 1990). Genomics, 9:557-560, 1991. [10] Kimberling, W. J., Moller, C. G., Davenport, S., Priluck, I. A., Beighton, P. H., Greenberg, J., Reardon, W., Weston, M. D., Kenyon, J. B., Grunkmeyer, J. A., Pieke, Dahl S., Overbeck, L. D., Blackwood, D. J., Brower, A. M., Hoover, D. M., Rowland, P., Smith, R. J. H. Linkage of Usher syndrome type I gene (USH1B) to the long arm of chromosome 11. Genomics, 14:988-994,1992. [11] Kimberling, W.J., Weston, M. D., Moller, C. G., Davenport, S. L. H., Shugart,

48

BRONYA KEATS

[12]

[13]

[14] [15]

[16] [17]

[18]

[19]

[20]

[21] [22]

[23]

Y. Y., Priluck, I. A., Martini, A., Smith, R. J. H. Localization of Usher syndrome type II to chromosome 1q. Genomics, 7:245-249, 1990. Lewis, R. A., Otterud, B., Stauffer, D., Lalouel, J.M., Leppert, M. Mapping recessive ophthalmic diseases: Linkage of the locus for Usher syndrome type II to a DNA marker on chromosome 1q. Genomics, 7:250-256, 1990. Lupski, J. R., Montes, de Oca-Luna R., Slaugenhaupt, S., Pentao, L., Guzzetta, V., Trask, B. J., Saucedo-Cardenas, 0., Barker, D. F., Killian, J. M., Garcia, C. A., Chakravarti, A., Patel, P. 1. DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell, 66:219-232, 1991. Morton, N. E. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. J. Hum. Genet., 8:80-96, 1956. Nicholson, G. A., Kennerson, M. L., Keats, B. J. B., Mesterovic, N., Churcher, W., Barker, D., Ross, D. A. Charcot-Marie-Tooth neuropathy type 1A mutation: Apparent crossovers with D17S122 are due to a duplication. Am. J. Med. Genet., 44:455-460, 1992. Rushton, W. F. The Cajuns: From Acadia to Louisiana. New York: Farrar Straus Giroux, 1979. Sankila, E. M., Pakarinen, L., Sistonen, P., Aittomaki, K., Kaariainen, H., Karjalainen, S., De la Chapelle, A. The existence of Usher syndrome type III proven by assignment of its locus to chromosome 3q by linkage. Am. J. Hum. Genet., (supplement) 55:A15, 1994. Smith, R. J. H., Lee, E. C., Kimberling, W. J., Daiger, S. P., Pelias, M. Z., Keats, B. J. B., Jay, M., Bird, A., Reardon, W., Guest, M., Ayyagari, R., Hejtmancik, J. F. Localization of two genes for Usher syndrome type 1 to chromosome 11. Genomics, 14:995-1002,1992. Spence, J. E., Perciaccante, R. G., Greig, G. M., Willard, H. F., Ledbetter, D. H., Hejtmancik, J. F., Pollack, M. S., O'Brien, W. E., Beaudet, A. L. Uniparental disomy as a mechanism for human genetic disease. Am. J. Hum. Genet., 42:217-226, 1989. Vance, J. M., Nicholson, G. A., Yamaoka, L. S., Stajich, J., Stewart, C. S., Speer, M. C., Hung, W., Roses, A. D., Barker, D., Pericak-Vance, M. A. Linkage of Charcot-Marie-Tooth neuropathy type 1a to chromosome 17. Exp. Neurol., 104:186-189, 1989. Weber, J. L., May, P. M. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44:388-396, 1989. Weber, J. L., Wang, Z., Hansen, K., Stephenson, M., Kappel, C., Salzman, S., Wilkie, P. J., Keats, B. J., Dracopoli, N. C., Brandriff, B. F., Olsen, A. S. Evidence for human meiotic recombination interference obtained through construction of a short tandem repeat polymorphism linkage map of chromosome 19. Am. J. Hum. Genet., 53:1079-1095,1993. Zoghbi, H. Y., Jodice, C., Sandkuijl, L. A., Kwiatkowski, T. J., McCall, A. E., Huntoon, S. A., Lulli, P., Spadaro, M., Litt, M., Cann, H. M., Frontali, M., Luciano, T. The gene for autosomal dominant spinocerebellar ataxia (SCA1) maps telomeric to the HLA complex and is closely linked to the D6S89 locus in three large kindreds. Am. J. Hum. Genet., 49:23-30, 1991.

ESTIMATING CROSSOVER FREQUENCIES AND TESTING FOR NUMERICAL INTERFERENCE WITH HIGHLY POLYMORPHIC MARKERS JURG OTT· Abstract. Interference may be viewed as having two aspects, numerical interference referring to the numbers of crossovers occurring, and positional interference referring to the positions of crossovers. Here, the focus is on numerical interference and on methods of testing for its presence. A dense map of highly polymorphic markers is assumed so that each crossover can be observed. General relationships are worked out between crossover distributions and underlying chiasma distributions. It is shown that crossover distributions may be invalid, and methods are developed to estimate valid crossover distributions from observed counts of crossovers. Based on valid estimates of crossover distributions, tests for interference and development of empirical map functions are outlined. The methods are applied to published data on human chromosomes 9 and 19.

1. Introduction. Below, standard genetic terminology is used. To avoid confusion, the following definitions are provided: Chiasma refers to the cytologically observable phenomenon that in meiosis the two homologous chromosomes establish close contact at some point(s) along their lengths. Several such chiasmata per chromosome may occur. Crossing-over (or crossover) is the process of reciprocal exchange between homologous chromosomes in meiosis (Nilsson et al. 1993). On a chromosome received by an individual from one of his parents, blocks of loci originating in one grandparent alternate with blocks of loci from the other grandparent. The switch of grandparental origins is caused by the occurrence of a crossover, which is known to involve one strand (chromatid) from each of the two homologous chromosomes (Mather 1938). In a gamete, the point on a chromosome separating two blocks of loci from different grandparents is called a crossover point or point of exchange. Occurrence of a crossover is believed to be the result of the formation of a chiasma but doubts have been raised whether this 1:1 relationship holds universally (Nilsson et al. 1993). In particular, in plant species, map distance estimates based on chiasma counts were compared with those based on RFLP maps, where the former turned out to be far lower than the latter (Nilsson et al. 1993). On the other hand, as is well known in experimental genetics, crossing-over leads to the formation of the so-called Holliday structure; it may be resolved by a cut of strands in one of two ways, with one cut leading to strands containing a crossover point between two markers on either side of the cut while the other cut does not result in • Department of Genetics and Development, Columbia University, Unit 58, 722 West 168 Street, New York, NY 10032. E-mail: [email protected] 49

50

JURG OTT

a crossover point (Ayala and Kiger 1984). Thus, chiasma frequencies would be expected to be higher than predicted on genetic grounds. At any rate, the material in this chapter addresses only those chiasmata with genetic consequences, that is, chiasmata of which each results in a crossover point on two of the four gametes. Consider two alleles, one at each of two loci, received by an offspring from one of his parents. A recombination is said to have occurred between the two loci if the two alleles originated in different grandparents, whereas a nonrecombination corresponds to allelic origin in the same grandparent. When the two loci reside on different chromosomes, recombination is a random event (occurring with probability ~) due to random inclusion of either of the two chromatids in a gamete. For loci on the same chromosome, occurrence of recombination depends on the number of crossover points occurring between the two loci in a gamete. An odd number of crossover points between two loci in a gamete is seen as a recombination and an even number as a nonrecombination. The average number, d, of crossover points (per gamete) between two loci on a chromosome is defined as the genetic distance (d Morgans, M, or 100d centimorgans, cM) between them. It is equal to one half the average number of chiasmata occurring between the two loci. For example, on chromosome 1 (male map length ~ 2M; Morton 1991), in male meiosis, an average of approximately four chiasmata are formed so that a gamete resulting from such a meiosis carries an average of two crossover points. If an interval is small enough so that at most one crossover occurs in it, each recombination corresponds to a crossover and the recombination fraction coincides with map length of the interval. Interference is defined as dependence in the occurrence of crossovers. Two types of interference are generally distinguished (Mather 1938). Chiasma interference (henceforth simply called interference) refers to the number or position of crossovers, and chromatid interference refers to which chromatids are involved in the chiasma formation. The latter is assumed to be absent in most species. In current human genetics papers, chiasma interference has been referred to under various new names, for example, meiotic recombination interference (Weber et al. 1993) and meiotic crossover interference (Kwiatkowski et al. 1993). When crossovers occur according to a Poisson process, interference is absent. Deviations from the Poisson process can be reflected in the numbers of single and multiple crossovers occurring (here called numerical interference) or in the positions where they occur (positional interference). Interference has been thought to be due to some stearic chromosomal property such as stiffness (Haldane 1919). Further, simply restricting the number of crossovers to some minimum or maximum also implies (numerical) interference. For example, the assumption of an obligatory chiasma with otherwise random occurrence of chiasmata implies interference, which is reflected in the Sturt map function (Sturt 1976). This type of interfer-

NUMERICAL INTERFERENCE IN GENETIC MAPS

51

ence is sometimes considered not being "real" in a biochemical sense as its nature is more statistical than due to interaction among crossover events, which would be reflected in positional interference. Below, criteria will be established for estimating valid crossover distributions. Based on these, tests for detecting numerical interference will be discussed. Much of this book chapter is devoted to theory. Application to published data for chromosomes 9 and 19 are presented in a section towards the end of this chapter. For all derivations it is assumed that each crossover can be observed unless it occurs outside the map of markers considered. This assumption is realistic when a large number of highly polymorphic markers exist on a chromosome such that intervals are so short that the possibility of multiple crossovers in an interval is negligible. 2. Estimating distributions of chiasmata and crossovers. In this section, the statistical relationships between crossover distributions (proportion of gametes carrying a certain number of crossovers on a given chromosome) and chiasma distributions (proportion of meioses showing a certain number of chiasmata on a given chromosome) are explored. Without chromatid interference, as is assumed here throughout, when a chiasma is formed at some location on a chromosome, the probability is ~ that a gamete resulting from the given meiosis will carry a crossover point at the location of the chiasma. Thus, for a given number, c, of chiasmata on a chromosome, the number, k, of crossover points on a gamete follows a Binomial (c, ~) distribution. The distribution of k can be obtained from the distribution of c by N

(2.1)

P(I< = k) =

L P(klc)P(C = c), c=o

where N is the maximum number of crossovers occurring. For finite N, the values of P(klc) form an N x N triangular matrix,

(2.2)

P(klc)

={

m~~r

if k < c if k > c.

This matrix is of full rank and provides for a 1 : 1 mapping between P( k) and P(c). Each P(c) defines a valid unique P(k). The inverse operation, while numerically unique, may lead from a given P(k) to a set of numbers some of which are negative or larger than 1. In other words, there are crossover distributions that do not correspond to a valid chiasma distribution. Such crossover distributions are biologically meaningless and are, thus, invalid. The direct inverse of (2.1) is easily obtained as

(2.3)

P(c=i)=2 i [P(k=i)-.t J=.+l

(~) (~y P(c=i)],

52

JURG OTT

which requires an order of evaluation from the top down, that is, c(N) must be calculated first, then e( N - 1), etc. Direct estimates of crossover distributions are typically obtained as multinomial proportions of numbers of crossovers. For example, if n( k) is the observed number of gametes carrying k crossovers, the crossover distribution P( k) is estimated directly by the proportions, n( k) / I::i n( i), k = 0, 1, ... , N. However, the estimated class proportions may correspond to an invalid associated chiasma distribution in which case these proportions are not maximum likelihood estimates (MLEs). Then, the MLE of a crossover distribution must be obtained by a different procedure. The procedure proposed in the next paragraph first carries out transformation (2.3) on the direct crossover frequency estimates. If necessary, the resulting values of P( e) are then transformed into a valid chiasma distribution, which, in turn, leads to the MLE of the crossover distribution; because of the 1:1 nature of transformation (2.1), the MLE of a chiasma distribution also defines the MLE of the crossover distribution derived from it. A convenient iterative method for obtaining MLEs of crossover distributions works via MLEs of associated valid chiasma distributions. It is based on the following representation of the log likelihood: (2.4) where N is the maximum number of crossovers observed, M(> N) is a suitable upper limit, such as 20, for the number of chiasmata, and the qi = P(e = i), i = 1...M, are the chiasma class probability parameters to be estimated, with qo = 1-ql -q2- ... (the estimates of Pi, i = N + 1, ... , M, are all equal to zero). Taking partial derivatives of (2.4) and setting them equal to zero leads to (2.5)

I::f-o n(k)P(e = 11k)

_

I::f-o n(k)P(e = 21k)

Based on expression (2.5), the algorithm starts with an initial chiasma distribution, for example, qi = l/(M + 1) for all i. Then, for a given class k of the crossover distribution, the conditional chiasma distribution, P(elk), is computed and the observations n( k) probabilistic ally assigned to the chiasma classes, that is, proportional to the P( elk). Once this is done for all k, those portions of the crossover observations assigned to a class, P( e = j), are added and the result divided by the total number of observations, thus obtaining an updated estimate of the chiasma distribution, which completes one iteration. Once MLEs of valid chiasma class probabilities have been obtained, they are transformed by (2.1) into the corresponding crossover class frequencies, which are then valid MLEs. This method has been implemented in a program, CROSSOVR, which is now one of the Linkage Utility Programs (Ott 1991). While the approach

NUMERICAL INTERFERENCE IN GENETIC MAPS

53

Pl 0.75f----.

0.75

FIG. 1.

implemented in CROSSOVR works well and generally fast, occasionally convergence may be slow so that several thousand iterations are necessary to reach an accuracy of, say, 10- 6 for the chiasma class probabilities. For small values of M, it is easy to analytically demonstrate invalidity of crossover distributions. Let Pi = probability of i crossovers, and qi = probability of i chiasmata. Assume, for example, a maximum of M = 1 chiasma (complete interference) on a chromosome. Then, by (2.1), Pi = ! qi, and qi = 2pi. Because of qi ::; 1, one must have Pi ::; !. Whenever an estimate, Pi, exceeds the value!, the associated chiasma probability qi exceeds 1 and is, thus, invalid. Of course, in this case, Pi coincides with the recombination fraction, which is known to be restricted to values up to ! only. The reason that invalid crossover distributions occur is that gametes produced by a parent are sampled at random. With M = 1, when a chiasma has occurred, half of the gametes will carry a crossover and half of them will not. Thus, one might by chance observe too many gametes carrymg a crossover. For M = 2, the chiasma distribution parameters are given by qi = 2(Pi - 2p2) and q2 = 4p2. Restricting each of the qi to the range (0, 1) leads to the conditions 2P2 ::; Pi ::; !. In the (Pi, P2)-plane, as shown in figure 1, the admissible range of values is contained within a triangle 1/8 the surface of the whole parameter space. With small numbers of observations, due to random fluctuations, it will happen relatively frequently that an observed crossover distribution is invalid. The probability that it is valid increases with the number of gametes investigated and with decreasing values of Pi and P2.

54

JURG OTT

3. Obligatory chiasma per chromosome. It is generally assumed that crossing-over is required for proper segregation of the homologous chromosomes in meiosis (Kaback et al. 1992). In all organisms in which recombination normally occurs there seems to be at least one chiasma on each chromosome per meiosis (Baker et al. 1976). As is mentioned in the introduction, this obligatory chiasma is assumed to be resolved such that it has genetic consequences. Presence of an obligatory chiasma is formulated as P(c 0) 0, that is, the zero class in the chiasma distribution is missing. In the iterative algorithm described in connection with (2.5) above, the c = 0 class frequency was estimated along with all other class frequencies. It is easy to implement the requirement, P(c = 0) = 0, in this algorithm. At the end of each iteration cycle, the estimate for P(c = 0) is set equal to zero, and all other class frequencies are adjusted to again sum to 1.

= =

4. Incomplete chromosome coverage. Thus far it has been assumed that a chromosome is densely covered by markers and that a marker resides at each of the two chromosome ends. In reality, the two flanking markers may not extend all the way to the ends of the chromosome so that only a proportion, f < 1, ofthe chromosome will be covered by the marker map. Some of the genetic models discussed below allow for such incomplete chromosome coverage. In the context of chiasma frequency estimation discussed above, incomplete chromosome coverage can only be allowed for with assumptions on chromosomal positions of chiasmata. For example, assume occurrence of at least one chiasma per meiosis. For the case that this is the only chiasma occurring, and under the assumption that it is equally likely to occur anywhere on the chromosome, the probability is f that it will be formed on the marker map, and it will lead to a crossover with probability ~f. Then, the proportion of zero chiasmata in the (valid) chiasma distribution is an estimate for 1- j, and the proportion of gametes without a crossover is an estimate for 1 - ~ f. With multiple chiasmata occurring and some regularity assumptions on where they occur, one finds (details not shown here) that j is approximately estimated by (1 - qo) / E, where qo is the proportion of zero chiasmata on the marker map and E is the mean of the numbers of chiasmata occurring on the entire chromosome. Thus, on longer chromosomes (E > 1), j = 1 - qo is likely to overestimate chromosome coverage. As this chapter is on numerical rather than positional interference, these thoughts are not pursued further. 5. Tests for interference. In this section, the null distribution of crossover numbers under no interference will be compared with the observed numbers of crossovers. Null distributions without and with an obligatory chiasma will be considered. It will be seen that restricting observed crossover distributions to valid estimates tends to reduce evidence for in-

NUMERICAL INTERFERENCE IN GENETIC MAPS

55

terference. Absence of interference implies that the number of chiasmata occurring on a chromosome follows a Poisson distribution with parameter a, its mean. The crossover distribution corresponding to this chiasma distribution, by virtue of (2.1), is also Poisson but with mean b = a/2, which is the genetic length of the chromosome. The number of chiasmata or crossovers occurring on a portion of a chromosome also follow Poisson distributions, with means corresponding to the length of the interval considered. With an obligatory chiasma, under no interference, the number of chiasmata on a chromosome follows a truncated Poisson distribution (c 2: 1) but, as shown below, the corresponding number of crossovers is no longer Poisson. Sturt (1976) developed a map function based on the assumption of an obligatory chiasma. Here, frequency distributions of chiasmata and crossovers are given under this assumption. Two cases will be considered, 1) full coverage of a chromosome by the marker map, and 2) incomplete chromosome coverage. First, the crossover distribution over a whole chromosome (here called the Sturt crossover distribution) is discussed given that an obligatory chiasma occurs on each chromosome. Based on the truncated Poisson distribution (zero chiasma class missing), this crossover distribution can be derived by elementary statistical techniques as follows: (5.1)

for k = 0 for k = 1,2, ...

The mean of (5.1) is obtained as (5.2) where b, the single parameter of the Sturt crossover distribution (5.1), has no simple direct interpretation except that it is a monotonic function of the mean. To obtain the value of the parameter b corresponding to a given mean, the following equation may be executed recursively: b = m(1-e- 2b ), where initially b is set equal to m in the right hand side of this equation. The MLE, b, of b cannot be obtained in closed form but rearranging the likelihood equation leads to the following iterative solution: (5.3) u, where U = L,kk x n(k)/L,kn(k) is the sample mean and Uo is the sample proportion of gametes with zero crossovers; u/2 is a suitable initial value for b in the right side of (5.3). Note that u is not the maximum likelihood estimate of m (5.2). Now, extend this approach to the situation that the marker map only incompletely covers the chromosome. Consider the crossover distribution

o ::; b
i'

1

[1-2M( L: dk)]

75

~o,

k:ik=l

where j ~ i' means jk ~ i~ = 1 - i k , k = 1,···, m. Putting d l = X, d 2 = ... = dm = h, doing some simple manipulation and letting h ! 0, yields the condition

I-2M

where G = and G(r) is the rth derivative of G. Thus G is completely monotone on (0, (0), and one can also show that G(O) = 1, G/(O) = -2. It should be clear that equation (4.3) and its generalizations do indeed facilitate further mathematical development, but are they necessary constraints on a map function? The answer here must be no, and we offer three reasons why. While one must agree with Karlin and Liberman (1994, p. 212) that "it is essential and natural to operate with a general genomic region composed of a union from among the segments ... " , it is neither essential nor natural that this be done via (4.3) or its generalizations. Indeed (4.3) requires that the chance of having an odd number of crossover points in the first and third of three consecutive intervals on a meiotic product is simply a function of the total map length of these two intervals, and is independent of the map distance between them. This is inconsistent with most data on interference, which indicates that the extent of the interference between two intervals decreases from its highest level when they are adjacent, to a negligible level when they are well separated. Furthermore, using adjectives such as "illegitimate", "not valid" or "unrealistic" to describe map functions, whatever their motivation, for failing to be multilocus feasible must be premature, unless it has been shown that such map functions cannot arise in any probability model for recombination. As we shall see shortly, essentially all map functions currently in the literature can arise in association with stationary renewal chiasma processes and the assumption of NCI. Finally, it is still possible to derive non-trivial constraints on M from the incomplete set of equations relating values of M to multilocus recombination probabilities, without completing the set of equations in what now seems to be a somewhat arbitrary manner. The following argument is meant to be illustrative, for a systematic study along the lines sketched below has yet to be carried out. Let us go back to the six equations involving (pi l i 2 i 3 ) and values of M discussed above. A simple calculation yields the following equation

What can we learn from this? In general, perhaps not much, but under NCI, it is easy to check that P11l ~ P10l. This is a simple consequence of

76

T.P. SPEED

=

=

the generalized Mather formulae: P111 ~ q111 and P10l ~ qll1 + ~ qlOl. Thus we have shown that under NCI the left-hand side of (4.6) above is non-positive. Put d l = d 3 = hand d 2 = d, and divide by h 2 ; if M is twice differentiable, we deduce that M"(d) ~ 0. Thus map functions for processes which satisfy NCI must be bounded between and ~, have nonnegative first derivatives and non-positive second derivatives.

°

5. Connexions between map functions and chiasma processes In the previous section we saw that a map function M which satisfies not only the constraints defining multilocus feasiblility involving unions of not necessarily contiguous intervals, but also the stronger constraints corresponding to NCI, is representable as M = ~(1 - G) where G is completely monotone, G(O) 1 and G'(O) -2. We now show that in a sense such M only arise in the context of count-location chiasma processes. PROPOSITION 5.1. Suppose X to be a count-location chiasma process satisfying NC!. Then X has a map function M such that for any union A of intervals in [0,1) with total map length dA , we have

=

(5.1)

=

M(d A )

= ~ [1- Zx(A)].

Conversely, suppose X to be a chiasma process satisfying NCI, with a map function M. If M satisfies (5.1) for every union A of intervals with total map length dA , then there is a discrete distribution c and a diffuse measure F on [0, 1) such that X has the same distribution as the count-location process with count distribution c and location distribution F.

Remarks. (a) The first half of this proposition is in the work of Karlin and Liberman (1978, 1979); we simply recall it to set our notation in place. They showed that if X is a count-location chiasma process with count distribution c = (Ck), and we assume NCI, then X has the map function M given by (5.2)

M(d)

= ~ [1- c (1- f)] ,

where c(s) L:k>O Ck Sk is the probability generating function of c and 2L = L:k>O kCk is- the mean number of crossover events on the bivalent. The same calculation that proves (5.2) also proves (5.1). (b) The second half of the proposition has also been proved previously, see Evans et al. (1993). We offer here a more analytic although less direct proof, making use of the facts concerning G = 1 - 2M listed before the statement of the proposition. Proof of the second half. Suppose X and M to be as postulated, and define the measure F = A J.lx and the sequence Ck = (_L)k G(k)(L)/k!, k = 0, 1,· .. , where G = 1- 2M. We assert that F is a probability measure on [0, 1), that c = (Ck) is a probability distribution on 0,1,2,···, and that X has the same distribution as the count-location chiasma process with count distribution (Ck) and location measure F. The first two assertions are easily

WHAT IS A GENETIC MAP FUNCTION?

77

checked. F is clearly a probability measure on [0,1). As for the numbers Ck, they are clearly non-negative, since G must be completely monotone by the argument of the previous section. Here we make our first use of (5.1), not just for intervals A, but for unions of intervals. Furthermore, Lk>O Ck = Lk>O (_L)k G(k)(L)/k! = G(L - L) = G(O) = l. -We now see that the probability generating function of this discrete distribution is just G(L(l - s)):

c(s)

= 2:>k Ck = ~)-sL)k G(k)(L)/k! = G(L(l- s)), k~O

k~O

as stated. It follows from Remark (a) above that this count-location process with count distribution c and location density F has map function M = Mc,F given by

Mc,F(d)

=~

[1- G (L (1- (1- t)))]

= M(d).

Since both Xc,F and X have the same map function, and these map functions satisfy (5.1) for unions of intervals, they have the same avoidance functions, and hence the same distribution. This completes the proof. 6. Interference, map distance and differential equations Crossover interference was described by Sturtevant (1915) and by Muller (1916), see Foss et al (1993) for a summary of the history of this topic. The traditional measure of interference is the coincidence c, which is the ratio of the chance of simultaneous recombination across both of two disjoint intervals 11 and 12 on a chromosome, to the product of the marginal probabilities of recombination across the intervals: c=

(6.1)

ru

(rlO + r11)(ro1

+ r11)

.

is the chance of i recombinations across interval 11 and h, i, j = 0,1. If there were no crossover position interference, and no chromatid interference, the coincidence would equal one. Observed coincidences tend to be near zero for small, closely linked intervals, increasing to one for more distant intervals. A number of forms of c have been used in the literature to describe the dependence of coincidence on map distance, and we refer to two such here. Haldane (1919) introduced what we call the semi-infinitesimaI3-point coincidence function (Liberman and Karlin (1984) call it the marginal coincidence function) c3(d) = limh-+o c(d, h), where c(d, h) is the coincidence between an interval It of map length d and a contiguous interval h of map length h. Here and in what follows we suppose that all limits exist, and are independent of the locations of the defining intervals, assumptions that are valid when chiasma processes are simple stationary point processes and NCI holds. Haldane (1919) used C3 to obtain the following differential In this formula

rij

j recombinations across

78

T.P. SPEED

equation for a map function: M(O) (6.2)

= 0, and

M'(d) = 1 - 2C3(d)M(d).

We refer to Liberman and Karlin (1984) for more details concerning this approach to map functions, and for a variety of examples obtained by this method. Karlin (1984) lists two difficulties with the construction of map functions using (6.2), the major one being that we do not know in advance which functions c3(d) will lead to map functions which can arise in practice. As we will see, c3(d) = 2M( d) and C3( d) = 2M(d)3 do lead to map functions which can arise, but there is no obvious way in which this could have been known in advance. Just as we saw in section 4 that a map function can define three-locus but not four-locus recombination probabilities, so we can see that the coincidence function C3 can only capture aspects of the chiasma or crossover process involving three but no more loci. An alternative form of c which we term the infinitesimal 4-point coincidence function c4(d) is defined as limh-+D limk-+D c(d, h, k), where c(d, h, k) is the coincidence between intervals hand [2 of map lengths hand k respectively, separated by map distance d. This measure is called 54 by Foss et al (1993), and seems to capture a more important aspect of crossover position interference than does C3. For example, by their construction, non Poisson count location processes manifest no crossover position interference. However, while c4(d) is constant for such processes, as one might expect, c3(d) is not constant. The latter results from the fact that the definition of C3( d) involves a non-infinitesimal interval of length d, and so C3 (d) reflects features of the marginal probability of recombinations occurring in an interval more than the interference of recombination events. 7. Stationary renewal chiasma processes. In this section we show that stationary renewal chiasma processes, i.e. renewal chiasma processes that are stationary with respect to their intensity measure, when combined with the assumption of NeI, give rise to a large class of map functions which are not multilocus feasible in the sense of Liberman and Karlin (1984). Indeed we will see in the next section that all of the map functions proposed to date can be associated with stationary renewal chiasma processes. It follows that there are many chiasma processes with map functions M for which (5.1) holds for all intervals A, but not all unions A of intervals. We will also find that it is possible for two stationary chiasma processes to have different distributions but the same map function; indeed one can satisfy (5.1) for all unions A of intervals, implying that the map function is multilocus feasible (and more), while the other process does not satisfy (5.1) for all such A. The realism or otherwise of stationary renewal chiasma processes is discussed in section 9 below. We begin by listing a set of conditions (A) on a function M from [0, L) to [0,1), where L may be finite or infinite. These conditions and the proposition which follows are from Zhao (1995).

79

WHAT IS A GENETIC MAP FUNCTION?

(AO) M(O) = 0; (AI) limdl£ M(d) = ~; (A2) M'(d) ~ 0 for all d; (A3) M'(O) = 1; (A4) limdl£ M'(d) = 0; (A5) M"(d) :::; 0 for all d.

We note in passing that if L = 00, then (A4) follows easily from the other conditions. However the (Morgan) map function M(d) = d, 0 :::; d :::; ~, shows that (A4) is needed in the following proposition. PROPOSITION 7.1. Let M be the map function for a stationary renewal chiasma process satisfying NCI on a chromosome arm of infinite map length. Then M satisfies conditions (A). Conversely, suppose that a function M : [0, L) - [0,1) satisfies conditions (A), where L may be finite or infinite. Then there is a stationary renewal chiasma process satisfying NCI whose map function is M. In both cases, the renewal density is - M". Proof. Suppose that X is such a stationary renewal chiasma process with renewal density f. Without loss of generality we may suppose that the mean inter-arrival time is ~, so that the metric with respect to which the process is stationary is that defining map distance. If F is the cumulative distribution function of f, then the residual lifetime density of the process is 2(1 - F) and the avoidance function for an interval I of map length dis thus

Zx(I)

=

1

00

2(1- F(y)) dy.

By Mather's formula (3.2), we have:

(7.1) Conditions (A) are now easily checked. Conversely, suppose that we have a function M satisfying conditions (A). We can see that -M"(Y) ~ 0 by (A5). Further, by (AO) and (A4),



and

-1£

-M"(y)dy = M'(O) - M'(L) = 1,

Y M"(y) dy

= [y M'(y)]; +



M'(y) dy

by (A4), (AO), and (AI). Finally, we obtain

M(d)

=~

[1-1£

2M'(y) dY]

=~ ,

80

T.P. SPEED

Thus M is the map function associated with the stationary renewal chiasma process with the renewal density -Mil having mean ~ and residual lifetime density 2M'. This completes our proof. As indicated in the introduction to this section, this proposition allows a very wide range of functions to arise as map functions; we will give examples in the next section. It is interesting to note that map functions M = ~ [1- G) where G is completely monotone, G(O) = 1 and G'(O) = -2, also satisfy conditions (A) when we permit L = 00. It is immediate that such M satisfy (AO), (A2), (A3) and (A5). To see that they also satisfy (AI) and (A4), it is easiest to use the representation of such a G as the Laplace transform of a positive measure, i.e. to represent M in the form

where